禁止商业或二改转载,仅供自学使用,侵权必究,如需截取部分内容请后台联系作者!
介绍
宏基因组数据的荟萃分析是一种综合多个独立宏基因组研究结果的方法,目的是揭示不同人群或样本中微生物群落的共同特征和差异。这种方法特别适用于跨区域、跨人群的大规模比较研究,能够帮助科学家们识别与特定健康状况或环境因素相关的微生物标志物。
meta
包中的 metagen
函数用于进行宏基因组数据的荟萃分析,其核心原理是综合多个独立研究的结果,以评估不同组别间在微生物群落组成上的差异性,并得出更加全面和可靠的结论。以下是该函数进行荟萃分析的一般原理:
- 数据整合:将不同研究的数据集整合在一起。这些数据集可能来自不同的样本、人群或环境条件,但都关注相似的生物学问题。
- 效应量计算:对于每个研究,计算效应量(Effect Size),这通常表示为组间差异的度量,如对数比值(Log Ratio)或标准化均值差。
- 权重分配:根据每个研究的样本大小、效应量估计的变异性和其他统计特性分配权重。较大的权重通常给予那些样本量大、估计更精确的研究。
- 异质性评估:评估不同研究结果之间的异质性,即研究结果差异是否超出了随机误差的预期。这可以通过I²统计量或Q统计量来完成。
- 固定效应和随机效应模型:根据异质性的大小,选择使用固定效应模型(假设所有研究共享相同的效应量)或随机效应模型(允许不同研究有不同的效应量)。
- 荟萃分析结果的合并:使用加权平均或基于模型的方法将不同研究的效应量合并,得出综合效应量估计。
- 置信区间和显著性检验:计算合并效应量的置信区间,并进行显著性检验,以评估组间差异是否具有统计学意义。
- 敏感性分析和偏倚评估:进行敏感性分析以检查荟萃分析结果对单个研究的依赖程度,以及评估潜在的发表偏倚。
加载依赖包
{r}
library(curatedMetagenomicData)
library(tidyverse)
library(SummarizedExperiment)
library(TreeSummarizedExperiment)
library(gtsummary)
library(ANCOMBC)
library(meta)
library(ggvenn)
library(forestploter)
library(grid)
数据下载
通过curatedMetagenomicData
R包下载健康人群的样本数据和肠道微生物宏基因组数据。
样本数据
通过curatedMetagenomicData
R包下载健康人群的粪便样本数据,数据满足以下标准:
-
age
>= 16 -
body_site
是粪便样本 -
study_condition
是健康样本 -
BMI
和gender
不是缺失值 -
样本采集时间符合要求
{r}
# 筛选样本
metadata <- curatedMetagenomicData::sampleMetadata %>%
dplyr::filter(age >= 16 &
body_site == "stool" &
study_condition == "control" &
is.na(BMI) != TRUE &
is.na(gender) != TRUE &
days_from_first_collection %in% c(0, NA))
# 过滤重复样本(可能同一个参与者采集多次样本)
metadata <- metadata %>%
dplyr::group_by(study_name, subject_id) %>%
dplyr::filter(row_number() == 1) %>%
dplyr::ungroup()
# 根据gender数目和比例筛选符合要求的研究
datasets_tokeep <- metadata %>%
dplyr::select(study_name, gender) %>%
dplyr::group_by(study_name) %>%
dplyr::summarise(
n_males = sum(gender=="male"),
n_females= sum(gender=="female"),
N=n()) %>%
dplyr::mutate(keep = (pmin(n_males, n_females) >= 40) &
(n_females/N >= 0.25) & (n_males/N >= 0.25)) %>%
dplyr::filter(keep == TRUE)
head(datasets_tokeep)
study_name | n_males | n_females | N | keep |
---|---|---|---|---|
AsnicarF_2021 | 306 | 791 | 1097 | TRUE |
DeFilippisF_2019 | 46 | 51 | 97 | TRUE |
DhakanDB_2019 | 43 | 46 | 89 | TRUE |
HMP_2012 | 51 | 44 | 95 | TRUE |
KeohaneDM_2020 | 51 | 65 | 116 | TRUE |
LifeLinesDeep_2016 | 474 | 661 | 1135 | TRUE |
统计样本分布情况,查看不同研究在年龄、性别和BMI的分布状态。
{r}
metadata <- metadata %>%
dplyr::filter(study_name %in% datasets_tokeep$study_name)
metadata %>%
dplyr::select(study_name, age, gender, BMI) %>%
gtsummary::tbl_summary(by = "study_name") %>%
gtsummary::modify_header(label ~ "**Characteristic**") %>%
gtsummary::add_p() %>%
gtsummary::add_overall() %>%
gtsummary::bold_labels()
肠道微生物数据
通过curatedMetagenomicData::returnSamples()
函数下载数据(数据格式是TreeSummarizedExperiment数据对象),并对数据进行预处理。选择relative_abundance
-
relative abundance(用于差异分析的linear model方法,通过
counts = FALSE
控制) -
counts abundance(用于差异分析的ANCOMBC方法,通过
counts = TRUE
控制)
{r}
if (file.exists("./data/healthStool-tse-count.RDS")) {
tse_prop <- readRDS("./data/healthStool-tse-prop.RDS")
tse_count <- readRDS("./data/healthStool-tse-count.RDS")
} else {
tse_prop <- curatedMetagenomicData::returnSamples(
sampleMetadata = metadata,
dataType = "relative_abundance",
counts = FALSE,
rownames = "short")
#Separating Asnicar USA from Asnicar GBR
colData(tse_prop)[which(colData(tse_prop)$study_name == "AsnicarF_2021" &
colData(tse_prop)$country == "USA"),]$study_name <- "USA_asnicarF_2021"
saveRDS(tse_prop, "./data/healthStool-tse-prop.RDS", compress = TRUE)
tse_count <- curatedMetagenomicData::returnSamples(
sampleMetadata = metadata,
dataType = "relative_abundance",
counts = TRUE,
rownames = "short")
#Separating Asnicar USA from Asnicar GBR
colData(tse_count)[which(colData(tse_count)$study_name == "AsnicarF_2021" &
colData(tse_count)$country == "USA"),]$study_name <- "USA_asnicarF_2021"
saveRDS(tse_count, "./data/healthStool-tse-count.RDS", compress = TRUE)
}
tse_count
class: TreeSummarizedExperiment
dim: 1109 3186
metadata(0):
assays(1): relative_abundance
rownames(1109): Phocaeicola vulgatus Bacteroides stercoris ... Oligella ureolytica Paenibacillus sp. 7884-2
rowData names(7): superkingdom phylum ... genus species
colnames(3186): SAMEA7041133 SAMEA7041134 ... wHAXPI034923-12 wHAXPI034926-15
colData names(140): study_name subject_id ... ALT eGFR
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (1109 rows)
rowTree: 1 phylo tree(s) (10430 leaves)
colLinks: NULL
colTree: NULL
线性回归分析
通过控制年龄和体重指数等协变量,构建单数据集与性别相关的微生物物种差异丰度的线性回归模型。获取该模型中微生物物种的效应值和效应值误差,它们将用于后续荟萃分析。
函数
-
d_fromlm
: 计算线性模型t值的标准偏差 -
SE_d
: 计算t值的标准偏差的标准误,该值用于荟萃分析 -
asin_trans
: relative abundance数据转换(线性模型要求数据服从正态分布) -
computeStandardizedMeanDifference
: 计算单个数据集与性别相关的微生物物种的统计结果,用于荟萃分析
{r}
# Compute Standardized Difference (d) from t-value
d_fromlm <- function(n1, n2, t) {
d <- (t*(n1+n2)/(sqrt(n1*n2)*sqrt(n1+n2-2)))
return(d)
}
# Compute Std. error of d
SE_d <- function(n1, n2, d) {
se_d <- sqrt(((n1+n2-1)/(n1+n2-3))*((4/(n1+n2))*(1+((d**2)/8))))
return(se_d)
}
# Transform relative abundances
asin_trans <- function(rel_ab){
return(asin(sqrt(rel_ab/100)))
}
# Function that takes in input a tse object and for each dataset computes a linear model for each linear species
computeStandardizedMeanDifference <- function(tse_object){
tse_datasets <- unique(SummarizedExperiment::colData(tse_object)$study_name)
singleDatasetAnalyze <- function (dataset) {
single_dataset_tse <- tse_object[,tse_object$study_name == dataset]
exprs_df <- asin_trans(SummarizedExperiment::assay(single_dataset_tse))
exprs_df <- exprs_df[rowSums(is.na(exprs_df)) != ncol(exprs_df), ]
species <- rownames(exprs_df)
age <- SummarizedExperiment::colData(single_dataset_tse)$age
bmi <- SummarizedExperiment::colData(single_dataset_tse)$BMI
gender <- as.factor(SummarizedExperiment::colData(single_dataset_tse)$gender)
compute_lm <- function(exprs_row){
lm_res <- broom::tidy(lm(exprs_row ~ bmi + age + gender))
lm_res <- tibble::column_to_rownames(lm_res, var = "term")
res <- lm_res["gendermale",c("statistic","p.value"),drop=TRUE]
return(res)
}
lmResults <- t(
sapply(species,
FUN = function(x) {
species_relabs <- exprs_df[x,]
res <- compute_lm(species_relabs)
return(res)
}))
n_gender <- c(table(gender)["male"][[1]],
table(gender)["female"][[1]])
#Compute effect size and respective standard error for each species
#in single dataset
d_List <- as.vector(sapply(lmResults[,"statistic"],
function(x) {d_fromlm(n_gender[1], n_gender[2], x)}))
SE_d_List <- as.vector(sapply(d_List,
function(x){SE_d(n_gender[1], n_gender[2], x)}))
#Wald test for relative abundance of species between males and females
#(default of lm() function)
wald_list <- as.double(lmResults[,"p.value", drop=TRUE])
#FDR-correction with Benjamini-Hochberg method for Wald p-values
q_values <- p.adjust(wald_list,method = "BH")
final_df <- as.matrix(cbind(d_List,
SE_d_List,
wald_list,
q_values))
#Finalize results for the single dataset
colnames(final_df) <- c(paste0(dataset,"_CohenD"),
paste0(dataset,"_SE_d"),
paste0(dataset,"_pvalue"),
paste0(dataset,"_Qvalue"))
rownames(final_df) <- species
return(final_df)
}
linear_models <- lapply(tse_datasets, singleDatasetAnalyze)
names(linear_models) <- tse_datasets
return(linear_models)
}
荟萃分析函数
runMetaanalysis
函数用于荟萃分析,它基于单个线性模型的结果再使用meta::metagen
进行荟萃分析。荟萃分析的结果包含所有研究的效应值(RE)和效应值的误差(SE_RE)。
{r}
runMetaanalysis <- function(d_vector, SE_d_vector) {
a <- meta::metagen(
TE = d_vector,
seTE = SE_d_vector,
studlab = rownames(d_vector),
method.tau = "PM",
sm = "SMD") # Standardized Mean Difference
final_vector <- c(
a$TE.random,
a$seTE.random,
paste(a$lower.random, a$upper.random, sep = ";"),
a$zval.random,
a$pval.random,
a$tau2,
a$I2)
names(final_vector) <- c("RE", "SE_RE", "CI_RE", "Zscore", "p-value", "tau2", "I^2")
return(final_vector)
}
运行荟萃分析
-
computeStandardizedMeanDifference
获得每个数据集的线性模型结果(CohenD 和SE_d统计量用于荟萃分析) -
合并每个数据集的差异分析结果
{r}
SDM <- computeStandardizedMeanDifference(tse_object = tse_prop)
#SDM$USA_asnicarF_2021
# 合并结果数据
final_df <- Reduce(function(x, y) {merge(x, y, all=TRUE)},
lapply(SDM, function(x){data.frame(x, rn = row.names(x))})) %>%
tibble::column_to_rownames(var = "rn")
# Run meta-analyses
d_matrix <- final_df %>%
dplyr::select(contains("CohenD")) %>%
t()
se_matrix <- final_df %>%
dplyr::select(contains("SE_d")) %>%
t()
meta_result <- t(
mapply(function(x, y){runMetaanalysis(x, y)},
as.data.frame(d_matrix),
as.data.frame(se_matrix)))
final_df <- cbind(final_df, meta_result)
final_df$FDR_Qvalue <- p.adjust(final_df$`p-value`, method = "BH")
head(final_df)
USA_asnicarF_2021_CohenD | USA_asnicarF_2021_SE_d | USA_asnicarF_2021_pvalue | |
---|---|---|---|
[Bacteroides] pectinophilus | -0.18299435 | 0.2056472 | 0.4038392 |
[Butyribacterium] methylotrophicum | 0.14592699 | 0.2054911 | 0.5053179 |
[Clostridium] cocleatum | NaN | NaN | NaN |
[Clostridium] hylemonae | -0.10582146 | 0.2053618 | 0.6288533 |
[Clostridium] innocuum | -0.15061136 | 0.2055089 | 0.4917828 |
[Clostridium] leptum | 0.07235138 | 0.2052853 | 0.7409638 |
结果:每个微生物物种对应的单个研究的效应值(CohenD)和荟萃分析的效应值(RE)。
可视化结果
采用森林图展示荟萃分析的结果,该结果包含效应值RE的95%置信区间和对应的P值。
{r}
plotdata <- final_df %>%
tibble::rownames_to_column("Species") %>%
dplyr::select(Species, RE, SE_RE, CI_RE, FDR_Qvalue) %>%
dplyr::filter(!is.na(RE)) %>%
dplyr::filter(FDR_Qvalue < 0.001) %>%
dplyr::group_by(Species) %>%
dplyr::mutate(RE_lower = unlist(strsplit(CI_RE, ";"))[1],
RE_upper = unlist(strsplit(CI_RE, ";"))[2]) %>%
dplyr::ungroup() %>%
dplyr::mutate(RE_new = round(as.numeric(RE), 2),
RE_lower = round(as.numeric(RE_lower), 2),
RE_upper = round(as.numeric(RE_upper), 2),
`HR (95% CI)` = paste0(RE_new," (", RE_lower, " to ", RE_upper, ")")) %>%
dplyr::mutate(SE_RE = as.numeric(SE_RE),
FDR_Qvalue = signif(as.numeric(FDR_Qvalue), 4))
plotdata$` ` = paste(rep(" ", nrow(plotdata)), collapse = " ")
for_theme <- forest_theme(
base_size = 10,
ci_pch = 15,
ci_col = "#762a83",
ci_fill = "black",
ci_alpha = 0.8,
ci_lty = 1,
ci_lwd = 2,
ci_Theight = 0.2,
refline_gp = gpar(lwd = 1, lty = "dashed", col = "grey20"),
vertline_gp = gpar(lwd = 1, lty = "dashed", col = "grey20"),
summary_fill = "yellow",
summary_col = "#4575b4",
footnote_gp = gpar(cex = 0.6, fontface = "italic", col = "red"))
pl <- forest(
plotdata %>%
dplyr::select(Species, ` `, `HR (95% CI)`, `FDR_Qvalue`),
est = plotdata$RE_new,
lower = plotdata$RE_lower,
upper = plotdata$RE_upper,
sizes = plotdata$SE_RE*10,
ci_column = 2,
xlim = c(-0.5, 0.4),
arrow_lab = c("Enriched in female", "Enriched in male"),
theme = for_theme)
pl
结果 :荟萃分析筛选到17种差异微生物。
ANCOMBC分析
使用ANCOMBC
方法对每个研究的gender(male vs female)进行差异分析,获得每个数据集的差异分析结果即每个物种的效应值和效应值标准误差。ANCOMBC详细分析见:GMSB文章五:微生物组差异分析ANCOMBC2。
{r}
#Function that takes in input a tse object and for each dataset performance ANCOMBC for each linear species
computeANCOMBC <- function(tse_object){
names(assays(tse_object)) <- "counts"
tse_datasets <- unique(SummarizedExperiment::colData(tse_object)$study_name)
singleDatasetAnalyze <- function (dataset) {
single_dataset_tse <- tse_object[, tse_object$study_name == dataset]
temp_res <- ANCOMBC::ancombc2(
data = single_dataset_tse,
assay_name = "counts",
fix_formula = "gender + age + BMI",
rand_formula = NULL,
p_adj_method = "BY",
prv_cut = 0.10,
lib_cut = 1000,
s0_perc = 0.05,
group = "gender",
struc_zero = TRUE,
neg_lb = TRUE,
alpha = 0.05,
n_cl = 8,
verbose = TRUE,
global = TRUE,
pairwise = TRUE,
dunnet = TRUE,
trend = TRUE,
iter_control = list(tol = 1e-2, max_iter = 20, verbose = TRUE),
em_control = list(tol = 1e-5, max_iter = 100),
lme_control = lme4::lmerControl(),
mdfdr_control = list(fwer_ctrl_method = "holm", B = 100),
trend_control = list(contrast = list(matrix(c(1, 0, 0,
-1, 1, 0,
0, -1, 1),
nrow = 3, byrow = TRUE),
matrix(c(-1, 0, 0,
1, -1, 0,
0, 1, -1),
nrow = 3, byrow = TRUE)),
node = list(3, 3),
solver = "ECOS",
B = 1000))
# ANCOMBC结果
res <- temp_res$res %>%
dplyr::select(taxon, lfc_gendermale, se_gendermale,
W_gendermale, p_gendermale, q_gendermale,
diff_gendermale, passed_ss_gendermale)
colnames(res) <- gsub("gendermale", dataset, colnames(res))
return(res)
}
ANCOMBC_res_list <- lapply(tse_datasets, singleDatasetAnalyze)
names(ANCOMBC_res_list) <- tse_datasets
return(ANCOMBC_res_list)
}
运行荟萃分析
-
computeANCOMBC
获得每个数据集的线性模型结果(lfc 和SE统计量用于荟萃分析) -
lfc可以认为是Standardized Mean Difference
-
合并每个数据集的差异分析结果
{r}
ANCOMBC_res <- computeANCOMBC(tse_object = tse_count)
# 合并结果数据
ANCOMBC_final_df <- Reduce(function(x, y) {merge(x, y, all=TRUE)},
lapply(ANCOMBC_res, function(x){data.frame(x)})) %>%
tibble::column_to_rownames(var = "taxon")
# Run meta-analyses
ANCOMBC_lfc_matrix <- ANCOMBC_final_df %>%
dplyr::select(starts_with("lfc")) %>%
t()
ANCOMBC_se_matrix <- ANCOMBC_final_df %>%
dplyr::select(starts_with("se")) %>%
t()
ANCOMBC_meta_result <- t(
mapply(function(x, y){runMetaanalysis(x, y)},
as.data.frame(ANCOMBC_lfc_matrix),
as.data.frame(ANCOMBC_se_matrix)))
ANCOMBC_final_df <- cbind(ANCOMBC_final_df, ANCOMBC_meta_result)
head(ANCOMBC_final_df)
lfc_USA_asnicarF_2021 | se_USA_asnicarF_2021 | W_USA_asnicarF_2021 | p_USA_asnicarF_2021 | |
---|---|---|---|---|
[Clostridium] innocuum | 0.06662334 | 0.2420969 | 0.27519291 | 0.7856251 |
[Clostridium] leptum | -0.02685778 | 0.3878781 | -0.06924285 | 0.9449966 |
[Clostridium] scindens | NA | NA | NA | NA |
[Clostridium] spiroforme | 0.37680978 | 0.2465282 | 1.52846507 | 0.1324578 |
[Clostridium] symbiosum | -0.08672899 | 0.2324258 | -0.37314699 | 0.7112896 |
[Eubacterium] rectale | 0.38639545 | 0.3502448 | 1.10321547 | 0.2730472 |
结果:每个微生物物种对应的单个研究的效应值(lfc)和荟萃分析的效应值(RE)。
可视化结果
采用森林图展示结果,该结果包含效应值RE的95%置信区间和对应的P值。
{r}
ANCOMBC_plotdata <- ANCOMBC_final_df %>%
tibble::rownames_to_column("Species") %>%
dplyr::mutate(FDR_Qvalue = p.adjust(as.numeric(`p-value`, method = "BH"))) %>%
dplyr::select(Species, RE, SE_RE, CI_RE, FDR_Qvalue) %>%
dplyr::filter(!is.na(RE)) %>%
dplyr::filter(FDR_Qvalue < 0.001) %>%
dplyr::group_by(Species) %>%
dplyr::mutate(RE_lower = unlist(strsplit(CI_RE, ";"))[1],
RE_upper = unlist(strsplit(CI_RE, ";"))[2]) %>%
dplyr::ungroup() %>%
dplyr::mutate(RE_new = round(as.numeric(RE), 2),
RE_lower = round(as.numeric(RE_lower), 2),
RE_upper = round(as.numeric(RE_upper), 2),
`HR (95% CI)` = paste0(RE_new," (", RE_lower, " to ", RE_upper, ")")) %>%
dplyr::mutate(SE_RE = as.numeric(SE_RE),
FDR_Qvalue = signif(as.numeric(FDR_Qvalue), 4))
ANCOMBC_plotdata$` ` <- paste(rep(" ", nrow(ANCOMBC_plotdata)), collapse = " ")
ANCOMBC_pl <- forest(
ANCOMBC_plotdata %>%
dplyr::select(Species, ` `, `HR (95% CI)`, `FDR_Qvalue`),
est = ANCOMBC_plotdata$RE_new,
lower = ANCOMBC_plotdata$RE_lower,
upper = ANCOMBC_plotdata$RE_upper,
sizes = ANCOMBC_plotdata$SE_RE,
ci_column = 2,
xlim = c(-4, 3),
arrow_lab = c("Enriched in female", "Enriched in male"),
theme = for_theme)
ANCOMBC_pl
结果 :荟萃分析筛选到21种差异微生物。
比较两种方法
两种方法筛选到的共有差异物种情况。
{r}
ggvenn(data = list(LM = plotdata$Species,
ANCOMBC = ANCOMBC_plotdata$Species))
结果 :两种方法筛选到的重复差异物种仅仅只有一个Hungatella hathewayi,这提示我们在筛选差异微生物的时候选择方法的重要性。
总结
在进行微生物组数据的荟萃分析时,我们首先采用了线性模型(Linear Model)和ANCOM-BC2两种统计方法对不同研究的数据进行差异性检验。随后,利用meta::metagen
函数对这两种方法的分析结果进行了综合的荟萃分析。然而,我们观察到通过这两种方法共同筛选出的差异物种数量异常有限。这一发现突显了不同差异分析方法在荟萃分析中的重要性和影响力,表明方法学的选择可能对最终的分析结论产生显著影响。因此,当我们开展荟萃分析工作时,必须谨慎考虑所采用的统计方法,以确保分析结果的准确性和可靠性。