从共病网络到可解释 AI:同济医院 10 分 SCI 全流程复现(R 语言)

一、引言

这篇内容,来得有些晚。

去年7月25日,我写过一篇关于"共病分析+可解释AI"的拆解,把同济医院那篇高分研究的路径一层层铺开:网络如何构建,特征如何进入模型,结果又如何被解释。那一篇停在"看懂",没有走到"复现"。

之后,陆续有人问:能不能把这套方法完整做一遍。问题其实很简单------不是思路,而是落地。

中间因为一些个人安排,这件事被搁置了下来。直到最近,才重新开始。

所以这一次,不再讲概念,而是从零搭建:用模拟数据重构共病关系,把网络变成特征,让模型给出结果,再把结果解释清楚。我们复现的不是某个具体结论,而是一种可以迁移的方法。

从这里开始,一步一步走完这条路径。

二、论文解析

这两年,"共病分析"突然变得频繁出现。从单病种研究,到多疾病共存的整体视角,研究范式正在发生变化------问题不再只是"某个因素是否相关",而是疾病之间如何相互作用、共同影响结局。

传统建模的局限也在这里显现。大多数方法仍然停留在"变量---结局"的平面关系上:年龄、性别、某个指标,各自进入模型,各自贡献权重。但现实中的临床数据,从来不是独立存在的。疾病之间有结构、有联系,而这些信息,在传统建模中往往被忽略。

同济医院那篇高分研究,做的其实是一件很直接的事:把这些"被忽略的结构"重新引入模型。通过构建共病网络,将疾病之间的关系显性化,再将网络特征融入预测模型,最后用可解释方法把结果还原成可以理解的机制路径。

2.1 研究背景与目的

NAFLD 与 CHD 存在共同发病机制,系统性低 - grade 炎症是关键连接点,但免疫炎症指标对 NAFLD 患者 CHD 风险的预测价值尚未明确。本研究旨在通过可解释机器学习,筛选并验证免疫炎症标志物对 NAFLD 患者并发 CHD 的预测能力。

2.2 研究对象与设计

  • 研究类型:回顾性横断面研究
  • 纳入时间:2021 年 1 月 ---2024 年 12 月
  • 初始样本:407 例经超声确诊 NAFLD 且行冠脉造影患者
  • 分组:NAFLD+CHD 组250 例、单纯 NAFLD 组157 例
  • PSM 匹配:1:1 匹配后最终纳入282 例(每组 141 例),平衡混杂因素

2.3 指标与方法

  1. 计算 10 项免疫炎症指标:
    SII、SIRI、NLR、PLR、PNR、NHR、MHR、PHR、LHR、NMR
  2. 核心分析流程
    • Lasso 回归筛选7 项混杂变量
    • PSM 消除组间混杂偏倚
    • 单因素 + 多因素 Logistic 回归确定独立危险因素
    • RCS 分析指标与 CHD 的线性 / 非线性关系
    • Spearman 相关分析指标与 Gensini 评分关联
    • 3 种机器学习建模,SHAP 解释变量重要性
    • 构建诊断列线图并验证效能

从方法上看,这篇研究并不只是"多做了几个分析",而是形成了一条完整的分析闭环。

首先,通过 Lasso 回归对潜在混杂变量进行筛选,再结合倾向评分匹配(PSM)实现组间平衡,本质是在解决一个问题:让不同人群之间具备可比性。这一层如果不成立,后续所有建模都会建立在偏差之上。

在此基础上,研究进一步通过单因素与多因素 Logistic 回归,识别独立危险因素;再结合限制性立方样条(RCS)分析变量与结局之间的线性或非线性关系,以及 Spearman 相关分析评估其与疾病严重程度(Gensini 评分)的关联。这一步的核心,不是预测,而是刻画变量与疾病之间的真实关系结构。

真正进入"AI部分"之后,模型本身反而不是重点。研究通过多种机器学习模型进行对比,并结合 SHAP 对变量贡献进行解释,将"预测结果"转化为"可理解的驱动因素"。最终,再通过列线图将这些信息整合,形成可用于临床决策的工具。

2.4 关键研究结果

  1. 多因素 Logistic 回归显示,NHR、NLR、SII、SIRI、NMR均为 NAFLD 患者发生 CHD 的独立危险因素,其中NHR的预测作用最强(OR=1.375,95% CI:1.021--1.852,P<0.001),PNR为保护因素(P<0.001);
  2. 限制性立方样条与相关性分析表明,NHR、NLR、PNR 与 CHD 风险呈线性相关,SII 则呈 J 型非线性关联(P=0.025),且NHR 与 Gensini 评分的相关性最高(r=0.256,P<0.001)。
  3. 预测效能方面,NHR 单指标 AUC 达 0.703,为最优单一标志物,联合 9 项指标构建的列线图模型 AUC 提升至 0.834,校准度与区分度均表现良好,且在心肌损伤标志物阴性亚组中,NHR 仍保持显著预测价值;
  4. 机器学习筛选证实,随机森林(RF)为最优模型,高 NHR、高龄、2 型糖尿病是驱动 CHD 共病风险的核心预测因子。

2.5 研究结论与局限性

本研究证实,6 项免疫炎症指标与 NAFLD 患者发生 CHD 显著相关,其中NHR 是预测 NAFLD 合并 CHD 的最佳单一指标,基于 SHAP 构建的联合列线图可有效用于临床精准风险分层;但本研究仍存在一定局限性,包括采用单中心、回顾性设计易产生选择偏倚,未纳入 NASH 及显著肝纤维化患者,研究对象仅局限于中国人群导致结论外推性受限,且缺乏时间‑事件分析无法确立因果预测关系。

三、R 语言复现

3.1 数据构建

本次将使用模拟数据,真实临床数据涉及隐私、伦理与医院权限限制,无法公开获取与广泛使用;而本次复现的核心是方法、流程与研究套路,并非验证某一批特定数据。

本次模拟数据严格遵循临床真实关联规律构建,而非完全随机生成,包含完整变量体系:人口学特征(年龄、性别、吸烟、高血压、糖尿病)、代谢血脂指标(TG、HDL、LDL)、血常规指标(中性粒细胞、淋巴细胞、单核细胞、血小板),以及共病结局 CHD(1 = 共病,0 = 无共病)。

R 复制代码
# ==============================
# 共病分析 SCI 复现:固定数量数据集
# 共病CHD=1:250例
# 非共病CHD=0:157例
# 总计 407 例
# ==============================

set.seed(20250725)
library(tidyverse)

# 分组样本量
n1 <- 250  # 共病组 chd=1
n0 <- 157  # 非共病组 chd=0

# 1. 共病组
dat1 <- data.frame(
  age   = round(runif(n1, 45, 80)),
  sex   = rbinom(n1, 1, 0.55),
  ht    = rbinom(n1, 1, 0.65),
  dm    = rbinom(n1, 1, 0.55),
  smoke = rbinom(n1, 1, 0.45),
  tg    = rnorm(n1, 2.0, 0.7),
  hdl   = rnorm(n1, 1.0, 0.25),
  ldl   = rnorm(n1, 3.3, 0.9),
  neu   = rnorm(n1, 4.0, 1.1),   # 中性粒细胞
  lym   = rnorm(n1, 1.6, 0.5),   # 淋巴细胞
  mon   = rnorm(n1, 0.55, 0.18), # 单核细胞
  plt   = rnorm(n1, 240, 55),    # 血小板
  chd   = 1
)

# 2. 非共病组
dat0 <- data.frame(
  age   = round(runif(n0, 35, 75)),
  sex   = rbinom(n0, 1, 0.40),
  ht    = rbinom(n0, 1, 0.35),
  dm    = rbinom(n0, 1, 0.25),
  smoke = rbinom(n0, 1, 0.25),
  tg    = rnorm(n0, 1.5, 0.5),
  hdl   = rnorm(n0, 1.25, 0.25),
  ldl   = rnorm(n0, 2.8, 0.7),
  neu   = rnorm(n0, 3.1, 0.8),
  lym   = rnorm(n0, 1.9, 0.5),
  mon   = rnorm(n0, 0.45, 0.12),
  plt   = rnorm(n0, 210, 45),
  chd   = 0
)

# 合并数据
dat <- rbind(dat1, dat0)

# 校验分组
table(dat$chd)


# --------------------
# 4. 计算6大免疫炎症指标
# --------------------
library(tidyverse)
dat <- dat %>%
  mutate(
    # 1. NLR 中性粒细胞/淋巴细胞
    NLR  = neu / lym,
    # 2. PLR 血小板/淋巴细胞
    PLR  = plt / lym,
    # 3. SII 系统免疫炎症指数
    SII  = (neu * plt) / lym,
    # 4. SIRI 系统炎症反应指数
    SIRI = (neu * mon) / lym,
    # 5. NHR 中性粒细胞/高密度脂蛋白
    NHR  = neu / hdl,
    # 6. MHR 单核细胞/高密度脂蛋白
    MHR  = mon / hdl,
    # 7. PHR 血小板/高密度脂蛋白
    PHR  = plt / hdl,
    # 8. LHR 淋巴细胞/高密度脂蛋白
    LHR  = lym / hdl,
    # 9. PNR 血小板/中性粒细胞
    PNR  = plt / neu,
    # 10. NMR 中性粒细胞/单核细胞
    NMR  = neu / mon
  )

# 查看变量名
colnames(dat)

结果展示:

R 复制代码
> # 校验分组
> table(dat$chd)

  0   1 
157 250 

> # 查看变量名
> colnames(dat)
 [1] "age"   "sex"   "ht"    "dm"    "smoke" "tg"    "hdl"   "ldl"   "neu"  
[10] "lym"   "mon"   "plt"   "chd"   "NLR"   "PLR"   "SII"   "SIRI"  "NHR"  
[19] "MHR"   "PHR"   "LHR"   "PNR"   "NMR"

3.2 基础统计分析

连续变量使用 Shapiro--Wilk 检验判断正态性,符合正态分布者以均数 ± 标准差表示,不符合正态分布者以中位数(四分位距)表示;分类变量以例数(百分比)表示。组间比较中,连续变量采用独立样本 t 检验或 Mann--Whitney U 检验,分类变量采用卡方检验或 Fisher 确切概率法。

R 复制代码
# ==============================
# 基线特征统计表 + 组间比较
# 完全对应论文统计方法
# ==============================

library(tidyverse)
library(tableone)

# 1. 设定变量列表
vars_cont <- c(
  "age", "tg", "hdl", "ldl",
  "neu", "lym", "mon", "plt",
  "NLR", "PLR", "SII", "SIRI",
  "NHR", "MHR", "PHR", "LHR",
  "PNR", "NMR"
)

vars_cat <- c(
  "sex", "ht", "dm", "smoke"
)

all_vars <- c(vars_cont, vars_cat)

# 2. 分组:chd = 1(共病组),chd = 0(对照组)
dat$chd <- factor(dat$chd, levels = c(0, 1), labels = c("NAFLD", "NAFLD+CHD"))

# 3. 构建 Table 1(自动正态检验、自动选检验方法)
tab1 <- CreateTableOne(
  vars = all_vars,
  strata = "chd",
  data = dat,
  factorVars = vars_cat
)

# 4. 输出三线表(符合 SCI 格式)
print(tab1, exact = vars_cat, nonnormal = vars_cont, quote = FALSE)

结果展示:

R 复制代码
> print(tab1, exact = vars_cat, nonnormal = vars_cont, quote = FALSE)
                     Stratified by chd
                      NAFLD                   NAFLD+CHD               p      test   
  n                      157                     250                                
  age (median [IQR])   53.00 [45.00, 62.00]    63.00 [56.00, 72.75]   <0.001 nonnorm
  tg (median [IQR])     1.41 [1.06, 1.84]       2.05 [1.59, 2.59]     <0.001 nonnorm
  hdl (median [IQR])    1.23 [1.09, 1.40]       1.03 [0.84, 1.21]     <0.001 nonnorm
  ldl (median [IQR])    2.84 [2.32, 3.33]       3.30 [2.70, 3.93]     <0.001 nonnorm
  neu (median [IQR])    3.07 [2.54, 3.57]       4.04 [3.23, 4.77]     <0.001 nonnorm
  lym (median [IQR])    1.92 [1.56, 2.27]       1.64 [1.31, 1.97]     <0.001 nonnorm
  mon (median [IQR])    0.45 [0.36, 0.53]       0.54 [0.43, 0.66]     <0.001 nonnorm
  plt (median [IQR])  209.10 [180.66, 238.90] 239.56 [208.86, 280.60] <0.001 nonnorm
  NLR (median [IQR])    1.61 [1.27, 2.01]       2.51 [1.85, 3.23]     <0.001 nonnorm
  PLR (median [IQR])  111.05 [88.13, 136.52]  150.21 [114.87, 191.93] <0.001 nonnorm
  SII (median [IQR])  331.64 [249.98, 435.04] 600.29 [410.05, 798.54] <0.001 nonnorm
  SIRI (median [IQR])   0.73 [0.49, 0.99]       1.24 [0.90, 1.83]     <0.001 nonnorm
  NHR (median [IQR])    2.44 [1.94, 3.08]       3.90 [3.07, 4.90]     <0.001 nonnorm
  MHR (median [IQR])    0.36 [0.28, 0.44]       0.53 [0.41, 0.68]     <0.001 nonnorm
  PHR (median [IQR])  166.91 [139.88, 202.39] 235.51 [189.82, 301.75] <0.001 nonnorm
  LHR (median [IQR])    1.51 [1.23, 1.94]       1.60 [1.27, 2.03]      0.217 nonnorm
  PNR (median [IQR])   66.31 [54.41, 85.82]    61.54 [48.32, 76.11]    0.005 nonnorm
  NMR (median [IQR])    6.65 [5.24, 8.90]       7.39 [5.78, 9.80]      0.029 nonnorm
  sex = 1 (%)             61 (38.9)              128 (51.2)            0.019 exact  
  ht = 1 (%)              47 (29.9)              167 (66.8)           <0.001 exact  
  dm = 1 (%)              40 (25.5)              133 (53.2)           <0.001 exact  
  smoke = 1 (%)           46 (29.3)              108 (43.2)            0.006 exact

3.3 逻辑回归

采用单因素 Logistic 回归筛选 CHD 潜在预测因子,将 P<0.05 的变量纳入多因素 Logistic 回归,确定独立危险因素。

R 复制代码
# ==============================
# Table 2 单因素 + 多因素 Logistic 回归
# ==============================

library(broom)
library(tidyverse)

# 确保因变量是数字 0/1
dat$chd_num <- as.numeric(dat$chd) - 1  # 0=NAFLD, 1=NAFLD+CHD

# --------------------
# 1. 单因素逻辑回归
# --------------------
uni_vars <- c("neu","hdl","ldl","SII","SIRI","NLR","PLR","PNR","NHR","MHR","PHR","LHR","NMR")
uni_result <- data.frame()

for (v in uni_vars) {
  f <- paste0("chd_num ~ ", v)
  fit <- glm(as.formula(f), family = binomial, data = dat)
  out <- tidy(fit, conf.int = T, exponentiate = T)
  out <- out[2, ]  # 取自变量行,不用 slice
  
  uni_result <- rbind(uni_result, data.frame(
    Variable = v,
    OR_univ = round(out$estimate, 3),
    CI_univ = paste0(round(out$conf.low,3), "-", round(out$conf.high,3)),
    P_univ = ifelse(out$p.value < 0.001, "<0.001", round(out$p.value, 3))
  ))
}

# --------------------
# 2. 多因素逻辑回归
# --------------------
multi_model <- glm(
  chd_num ~ neu + ldl + SII + SIRI + NLR + PNR + NHR + NMR + 
    age + ht + dm + smoke, 
  family = binomial, 
  data = dat
)

multi_res <- tidy(multi_model, conf.int = T, exponentiate = T)

multi_result <- multi_res %>%
  mutate(
    OR_multi = round(estimate, 3),
    CI_multi = paste0(round(conf.low,3), "-", round(conf.high,3)),
    P_multi = ifelse(p.value < 0.001, "<0.001", round(p.value, 3))
  ) %>%
  select(Variable = term, OR_multi, CI_multi, P_multi)

# --------------------
# 3. 合并成 Table 2
# --------------------

table2 <- merge(uni_result, multi_result, by = "Variable", all.x = TRUE)

# 查看最终表
print(table2)

结果展示:

R 复制代码
> print(table2)
   Variable OR_univ         CI_univ P_univ OR_multi      CI_multi P_multi
1       hdl   0.016     0.005-0.044 <0.001       NA          <NA>    <NA>
2       ldl   2.008     1.559-2.624 <0.001    2.179   1.438-3.387  <0.001
3       LHR   1.281     0.953-1.756  0.111       NA          <NA>    <NA>
4       MHR 331.965 79.609-1581.971 <0.001       NA          <NA>    <NA>
5       neu   2.689       2.108-3.5 <0.001    0.929     0.34-2.49   0.886
6       NHR   3.036     2.384-3.966 <0.001    3.775   2.187-7.018  <0.001
7       NLR   3.160     2.358-4.356 <0.001    0.348   0.092-1.208   0.109
8       NMR   1.075      1.02-1.144  0.014    1.184   0.985-1.536   0.191
9       PHR   1.015     1.012-1.019 <0.001       NA          <NA>    <NA>
10      PLR   1.014     1.009-1.019 <0.001       NA          <NA>    <NA>
11      PNR   0.996     0.991-1.001  0.159    1.030   1.014-1.049  <0.001
12      SII   1.005     1.004-1.006 <0.001    1.002   0.998-1.005   0.356
13     SIRI   7.887     4.77-13.763 <0.001   22.230 2.896-274.308   0.011

3.4 绘制多因素逻辑回归森林图

R 复制代码
library(forestploter)
library(grid)
library(dplyr)

plot_df <- multi_res %>%
  filter(term %in% c("neu","ldl","SII","SIRI","NLR","PNR","NHR","NMR")) %>%
  mutate(
    Characteristics = case_when(
      term == "neu" ~ "N",
      term == "ldl" ~ "LDLC",
      term == "SII" ~ "SII",
      term == "SIRI" ~ "SIRI",
      term == "NLR" ~ "NLR",
      term == "PNR" ~ "PNR",
      term == "NHR" ~ "NHR",
      term == "NMR" ~ "NMR"
    ),
    `Total(N)` = 282,
    `HR (95% CI)` = sprintf("%.3f (%.3f - %.3f)", estimate, conf.low, conf.high),
    `P value` = case_when(
      p.value < 0.001 ~ "<0.001",
      TRUE ~ as.character(round(p.value, 3))
    )
  ) %>%
  arrange(factor(term, levels = c("neu","ldl","SII","SIRI","NLR","PNR","NHR","NMR")))

plot_df$` ` <- strrep(" ", 25)

tm <- forest_theme(
  base_size = 11,
  refline_gp = gpar(lty = 2, col = "gray40"),
  ci_col = "black",
  ci_fill = "#6ecadc",
  xlim_gp = gpar(col = "black", lwd = 1.2)
)


p <- forest(
  data = plot_df[, c("Characteristics", "Total(N)", "HR (95% CI)", " ", "P value")],
  est = plot_df$estimate,
  lower = plot_df$conf.low,
  upper = plot_df$conf.high,
  ci_column = 4,
  ref_line = 1,
  xlim = c(0.8, 3),
  ticks_at = c(1.0, 1.5, 2.0, 2.5),
  theme = tm,
  xlab = ""
)

plot(p)

3.5 Lasso回归

R 复制代码
library(dplyr)
library(glmnet)

y <- dat$chd_num
x <- dat %>%
  select(age, sex, ht, dm, smoke,
         tg, hdl, ldl,
         plt, lym, mon) 

x_matrix <- as.matrix(x)
set.seed(123)


cvfit <- cv.glmnet(
  x_matrix, y,
  family = "binomial",
  alpha = 1,       # 纯 LASSO
  nfolds = 10
)


plot(cvfit, main = "")
title("LASSO Cross-Validation Curve", line = 2)


fit <- glmnet(x_matrix, y, family = "binomial", alpha = 1)

plot(fit, xvar = "lambda", label = FALSE, lwd = 2.5)
title("LASSO Coefficient Path", line = 2)


lambda_1se  <- cvfit$lambda.1se
lambda_1se

coef_min <- coef(cvfit, s = lambda_1se)

# 实际应该是使用注释这个的,由于是造的数据,相关性大了一些
# selected_features <- rownames(coef_min)[coef_min[, 1] != 0]
selected_features <- rownames(coef_min)[abs(coef_min[, 1]) > 0.1]
selected_features <- selected_features[selected_features != "(Intercept)"]

selected_features

结果展示:


R 复制代码
> selected_features
[1] "ht"  "dm"  "tg"  "hdl" "ldl" "lym" "mon"

3.6 PSM 1:1 倾向得分匹配

R 复制代码
library(MatchIt)
library(tableone)

psm_formula <- as.formula(
  paste("chd_num ~", paste(selected_features, collapse = " + "))
)

print(selected_features)


set.seed(123)
psm_model <- matchit(
  formula = psm_formula,
  data = dat,
  method = "nearest",  # 最近邻匹配
  ratio = 1,           # 1:1
  caliper = 0.02,      # 严格卡钳
  replace = FALSE
)


dat_psm <- match.data(psm_model)


cat("\n PSM 匹配后样本量:\n")
table(dat_psm$chd_num)


cat("\n 匹配后均衡性检验(SMD):\n")
tab1 <- CreateTableOne(
  vars = all_vars,
  strata = "chd",
  data = dat_psm,
  factorVars = vars_cat
)

print(tab1, exact = vars_cat, nonnormal = vars_cont, quote = FALSE)

结果展示:

R 复制代码
> table(dat_psm$chd_num)

 0  1 
52 52 


> print(tab1, exact = vars_cat, nonnormal = vars_cont, quote = FALSE)
                     Stratified by chd
                      NAFLD                   NAFLD+CHD               p      test   
  n                       52                      52                                
  age (median [IQR])   54.00 [47.50, 64.25]    64.50 [55.00, 73.00]   <0.001 nonnorm
  tg (median [IQR])     1.64 [1.20, 2.06]       1.66 [1.30, 2.29]      0.475 nonnorm
  hdl (median [IQR])    1.11 [0.98, 1.25]       1.17 [1.03, 1.28]      0.333 nonnorm
  ldl (median [IQR])    3.00 [2.52, 3.39]       2.85 [2.28, 3.59]      0.886 nonnorm
  neu (median [IQR])    3.13 [2.66, 3.74]       4.04 [3.16, 4.54]     <0.001 nonnorm
  lym (median [IQR])    1.68 [1.43, 2.08]       1.73 [1.47, 2.12]      0.692 nonnorm
  mon (median [IQR])    0.49 [0.38, 0.55]       0.48 [0.32, 0.59]      0.953 nonnorm
  plt (median [IQR])  206.39 [167.10, 243.88] 238.63 [206.82, 270.77]  0.004 nonnorm
  NLR (median [IQR])    1.82 [1.54, 2.32]       2.22 [1.63, 2.74]      0.023 nonnorm
  PLR (median [IQR])  120.90 [97.24, 152.44]  140.82 [100.26, 169.98]  0.140 nonnorm
  SII (median [IQR])  403.33 [330.56, 458.66] 500.19 [370.66, 659.45]  0.003 nonnorm
  SIRI (median [IQR])   0.87 [0.59, 1.13]       1.00 [0.80, 1.34]      0.127 nonnorm
  NHR (median [IQR])    2.90 [2.35, 3.24]       3.32 [2.71, 4.06]      0.006 nonnorm
  MHR (median [IQR])    0.42 [0.35, 0.53]       0.43 [0.30, 0.54]      0.711 nonnorm
  PHR (median [IQR])  192.91 [151.00, 225.68] 209.84 [164.34, 255.05]  0.069 nonnorm
  LHR (median [IQR])    1.54 [1.23, 1.90]       1.55 [1.30, 1.80]      0.943 nonnorm
  PNR (median [IQR])   60.63 [51.74, 84.79]    60.26 [49.18, 82.22]    0.692 nonnorm
  NMR (median [IQR])    6.21 [4.86, 9.23]       9.07 [6.13, 10.52]     0.007 nonnorm
  sex = 1 (%)             19 (36.5)               29 (55.8)            0.076 exact  
  ht = 1 (%)              26 (50.0)               26 (50.0)            1.000 exact  
  dm = 1 (%)              21 (40.4)               22 (42.3)            1.000 exact  
  smoke = 1 (%)           15 (28.8)               24 (46.2)            0.105 exact 

3.7 RCS限制立方图

R 复制代码
library(rms)
library(ggplot2)
library(gridExtra)
library(dplyr)

plot_rcs <- function(var, 
                     xlab = var, 
                     ylab = "OR (95%CI)",
                     data = dat_psm, 
                     adjust_vars = selected_features,
                     knots = 3) {
  
  adjust_str <- paste(adjust_vars, collapse = " + ")
  formula_str <- paste0("chd_num ~ rcs(", var, ", ", knots, ") + ", adjust_str)
  formula <- as.formula(formula_str)
  
  dd <- datadist(data)
  options(datadist = "dd")
  
  fit <- lrm(formula, data = data, x = TRUE, y = TRUE)
  
  anova_res <- anova(fit)
  p_overall <- signif(anova_res[1, 3], 3)
  p_nonlinear <- signif(anova_res[2, 3], 3)
  
  x_seq <- seq(min(data[[var]], na.rm = TRUE),
               max(data[[var]], na.rm = TRUE),
               length.out = 200)
  
  newdat <- data.frame(
    matrix(
      rep(colMeans(data[, adjust_vars, drop = FALSE], na.rm = TRUE), each = 200),
      ncol = length(adjust_vars)
    )
  )
  colnames(newdat) <- adjust_vars
  newdat[[var]] <- x_seq
  

  pred <- predict(fit, newdata = newdat, type = "lp")
  newdat$OR <- exp(pred)
  

  pred_se <- predict(fit, newdata = newdat, type = "lp", se.fit = TRUE)
  newdat$lower <- exp(pred_se$linear.predictors - 1.96 * pred_se$se.fit)
  newdat$upper <- exp(pred_se$linear.predictors + 1.96 * pred_se$se.fit)
  

  y_max <- max(newdat$OR, na.rm = T) * 1.2
  y_min <- 0
  

  p <- ggplot(newdat, aes(x = !!sym(var), y = OR)) +
    geom_ribbon(aes(ymin = lower, ymax = upper), fill = "#BDE4F1", alpha = 0.3) +
    geom_line(color = "#63B8D0", linewidth = 1.2) +
    geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
    labs(x = xlab, y = ylab) +
    theme_bw() +
    theme(
      panel.grid = element_blank(),
      panel.border = element_rect(color="black", linewidth=1),
      axis.text = element_text(color = "black"),
      axis.title = element_text(color = "black")
    ) +
    annotate("text", x = -Inf, y = Inf, 
             label = paste0("P overall = ", p_overall, "\nP nonlinear = ", p_nonlinear),
             hjust = -0.05, vjust = 1.2, size = 3.5) +

    coord_cartesian(ylim = c(y_min, y_max))
  
  return(p)
}


vars <- c("NHR", "NLR", "NMR", "PNR", "SII", "SIRI")
plots <- list()

for (i in seq_along(vars)) {
  plots[[i]] <- plot_rcs(var = vars[i])
}

grid.arrange(
  plots[[1]] + ggtitle("A"),
  plots[[2]] + ggtitle("B"),
  plots[[3]] + ggtitle("C"),
  plots[[4]] + ggtitle("D"),
  plots[[5]] + ggtitle("E"),
  plots[[6]] + ggtitle("F"),
  nrow = 2, ncol = 3
)

结果展示:

3.8 Spearman 相关分析

R 复制代码
library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggExtra) 
library(patchwork)

set.seed(123)

dat_psm <- dat_psm %>%
  mutate(
    gensini_score = 
      2.2 * NHR + 
      1.8 * NLR + 
      1.5 * SIRI + 
      0.9 * NMR + 
      0.6 * SII - 
      2.0 * PNR + 
      rnorm(nrow(.), mean = 0, sd = 4.5)
  ) %>%
  mutate(gensini_score = pmax(0, pmin(180, gensini_score)))


plot_spearman_full <- function(var,
                               xlab = var,
                               ylab = "GS",
                               data = dat_psm) {
  
  df <- data %>%
    select(all_of(var), gensini_score) %>%
    drop_na()
  
  cor_res <- cor.test(df[[var]], df$gensini_score, method = "spearman")
  r_val  <- round(cor_res$estimate, 3)
  p_val  <- signif(cor_res$p.value, 3)
  
  p_main <- ggplot(df, aes(x = .data[[var]], y = gensini_score)) +
    geom_point(color = "#63B8D0", alpha = 0.4, size = 2) +
    geom_smooth(method = "lm", se = TRUE,
                color = "#2E86AB", fill = "#BDE4F1", alpha = 0.3) +
    labs(x = xlab, y = ylab) +
    scale_y_continuous(limits = c(0, 180)) +  # 和原图GS范围一致
    theme_bw() +
    theme(
      panel.grid = element_blank(),
      panel.border = element_rect(color = "black", linewidth = 1),
      axis.text = element_text(color = "black", size = 9),
      axis.title = element_text(color = "black", size = 10)
    ) +
    annotate("text", x = Inf, y = Inf,
             label = paste0("Spearman\nR = ", r_val, "\nP = ", p_val),
             hjust = 1.1, vjust = 1.1, size = 3.5)
  
  p_full <- ggMarginal(
    p_main,
    type = "density",
    fill = "#BDE4F1",
    color = NA,
    alpha = 0.7
  )
  
  return(p_full)
}

vars <- c("NHR", "NLR", "NMR", "PNR", "SII", "SIRI")
plots <- list()

for (i in seq_along(vars)) {
  plots[[i]] <- plot_spearman_full(var = vars[i])
}


grid.arrange(
  plots[[1]] + ggtitle("A"),
  plots[[2]] + ggtitle("B"),
  plots[[3]] + ggtitle("C"),
  plots[[4]] + ggtitle("D"),
  plots[[5]] + ggtitle("E"),
  plots[[6]] + ggtitle("F"),
  nrow = 2, ncol = 3
)


wrap_plots(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]]) +
  plot_annotation(tag_levels="A") +
  plot_layout(nrow=2, ncol=3)


# 核对相关系数,和你论文是否完全匹配
cor_table <- data.frame(
  Index = vars,
  R = sapply(vars, function(v) round(cor.test(dat_psm[[v]], dat_psm$gensini_score, method="spearman")$estimate,3)),
  P = sapply(vars, function(v) signif(cor.test(dat_psm[[v]], dat_psm$gensini_score, method="spearman")$p.value,3))
)
print(cor_table)

结果展示:

R 复制代码
> print(cor_table)
         Index      R        P
NHR.rho    NHR  0.679 2.25e-15
NLR.rho    NLR  0.868 9.08e-33
NMR.rho    NMR  0.515 2.28e-08
PNR.rho    PNR -0.441 2.74e-06
SII.rho    SII  0.830 1.48e-27
SIRI.rho  SIRI  0.565 4.03e-10

3.9 模型构建和模型解释

  1. 划分数据集并建模
R 复制代码
library(h2o)
library(ggplot2)
library(dplyr)
library(caret)


h2o.init(nthreads = -1)

dat <- dat_psm
vars <- c("age","sex","ht","dm","smoke","tg","hdl","ldl",
          "NHR","NLR","NMR","PNR","SII","SIRI")

dat$chd_num <- factor(dat$chd_num, levels = c(0,1), labels = c("Control","CHD"))

set.seed(123)
trainIndex <- createDataPartition(dat$chd_num, p=0.7, list=FALSE)
train <- dat[trainIndex, ]
test  <- dat[-trainIndex, ]

train.hex <- as.h2o(train)
test.hex  <- as.h2o(test)
x <- vars
y <- "chd_num"


# 1. GLM
glm_h2o <- h2o.glm(x=x, y=y, training_frame=train.hex, family="binomial", seed=123)

# 2. Random Forest(主模型,画SHAP)
rf_h2o <- h2o.randomForest(x=x, y=y, training_frame=train.hex, ntrees=500, seed=123)

# 3. SVM
svm_h2o <- h2o.psvm(
  x = x,
  y = y,
  training_frame = train.hex,
  gamma = 0.01,
  rank_ratio = 0.1,
  disable_training_metrics = FALSE,
  seed = 123
)


shap_contrib <- h2o.predict_contributions(rf_h2o, newdata = train.hex)
shap_values   <- as.data.frame(shap_contrib[, vars])
X_mat         <- as.matrix(train[, vars])
y_train       <- train$chd_num
  1. 绘制残差图
R 复制代码
library(ggplot2)
library(dplyr)
library(scales)

y_true <- as.numeric(train$chd_num) - 1  # Control=0, CHD=1

p_glm <- as.data.frame(h2o.predict(glm_h2o, newdata = train.hex))$CHD
p_rf  <- as.data.frame(h2o.predict(rf_h2o,  newdata = train.hex))$CHD
p_svm <- as.data.frame(h2o.predict(svm_h2o, newdata = train.hex))$CHD

res_glm <- abs(y_true - p_glm)
res_rf  <- abs(y_true - p_rf)
res_svm <- abs(y_true - p_svm)

rcdf <- function(residuals, model_name) {
  r <- na.omit(residuals)
  r <- sort(r)
  n <- length(r)
  data.frame(
    residual = r,
    percent  = (n:1)/n * 100,
    model    = model_name
  )
}

df <- bind_rows(
  rcdf(res_glm, "glm"),
  rcdf(res_rf,  "rf"),
  rcdf(res_svm, "svm")
)

ggplot(df, aes(x = residual, y = percent, color = model)) +
  geom_line(linewidth = 1.1) +
  scale_color_manual(
    values = c("glm" = "#638bd6", "rf" = "#e67386", "svm" = "#99e6d6"),
    labels = c("glm", "rf", "svm")
  ) +
  scale_y_continuous(labels = percent_format(scale = 1), breaks = seq(0, 100, 10), limits = c(0, 100)) +
  scale_x_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1.00), limits = c(0, 1.05)) +
  labs(
    title = "Reverse cumulative distribution of |residual|",
    x = "|residual|",
    y = "",
    color = "Model"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(color = "#660099", size = 20, hjust = 0, face = "bold"),
    axis.text = element_text(color = "#6600cc", size = 14),
    axis.title.x = element_text(color = "#660099", size = 16),
    legend.position = "top",
    legend.title = element_text(color = "black", size = 14),
    legend.text = element_text(size = 14),
    panel.grid.major.y = element_line(color = "#e0e0e0", linetype = "solid"),
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank()
  )

结果展示:

  1. 绘制shap特征重要性图
R 复制代码
library(ggplot2)
library(dplyr)

shap_grouped <- data.frame(
  group = y_train,  # Control / CHD
  shap_values
)

control_shap <- colMeans(abs(shap_grouped[shap_grouped$group == "Control", vars]))
chd_shap     <- colMeans(abs(shap_grouped[shap_grouped$group == "CHD", vars]))


df_plot <- data.frame(
  Feature     = rep(vars, 2),
  Group       = rep(c("Control", "CHD"), each = length(vars)),
  MeanAbsSHAP = c(control_shap, chd_shap)
)

df_plot$Feature <- factor(
  df_plot$Feature,
  levels = names(sort(chd_shap, decreasing = FALSE))
)

ggplot(df_plot, aes(x = MeanAbsSHAP, y = Feature, fill = Group)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  scale_fill_manual(values = c("Control" = "#F7C336", "CHD" = "#5A2C82")) +
  labs(
    title = "B",
    x = "mean(|SHAP value|)",
    y = "",
    fill = ""
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0),
    legend.position = "right",
    panel.grid.major.y = element_line(color = "gray90"),
    panel.grid.major.x = element_blank()
  )

结果展示;

  1. shap Beeswarm图
R 复制代码
shap_plot <- h2o.shap_summary_plot(
  rf_h2o,    # 你的模型
  train.hex  # 数据集
)

shap_plot +
  scale_color_gradient(
    low = "#6A0D91",   # 紫色 Low
    high = "#FFB300"   # 黄色 High
  ) +
  labs(title = "C") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0)
  )

结果展示:

  1. 瀑布图
R 复制代码
library(shapviz)
library(h2o)
library(ggplot2)

X_mat <- as.matrix(train[, vars])  # 特征

shap_contrib <- h2o.predict_contributions(rf_h2o, newdata = train.hex)
shap_vals <- as.matrix(shap_contrib[, vars])

sv <- shapviz(
  object = shap_vals, 
  X = X_mat
)


sv_waterfall(
  sv, 
  row_id = 1, 
  max_display = 10,
  fill_colors = c("#6A0D91", "#FFB300")  # 红=正贡献,蓝=负贡献
) +
  labs(title = "D") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0)
  )

sv_force(
  sv, 
  row_id = 1, 
  max_display = 10,
  fill_colors = c("#6A0D91", "#FFB300")
)

结果展示:

  1. 列线图
R 复制代码
# A. 列线图
dd <- datadist(train)
options(datadist = "dd")

fit <- lrm(
  chd_num ~ SII + SIRI + NLR + PNR + NHR + age + ht + dm,
  data = train
)

nom <- nomogram(
  fit,
  fun = plogis,
  funlabel = "CHD Risk"
)

plot(nom, main = "A Nomogram for the diagnosis of CHD", cex.axis = 0.8)

结果展示:

  1. PR 曲线
R 复制代码
library(PRROC)

plot_prc_biomarkers <- function(result, title) {
  n_param <- length(result)
  
  # 严格和你原图的颜色一一对应
  colors <- c(
    "#4ECDC4",  # Age(青)
    "#E74C3C",  # HT(红)
    "#2ECC71",  # DM(绿)
    "#34495E",  # Smoke(深蓝)
    "#FFA07A",  # TG(浅橙)
    "#5D6D7E",  # HDL-C(灰蓝)
    "#82E0AA",  # LDL-C(浅绿)
    "#E74C3C",  # NLR(红,和HT同色)
    "#8B4513"   # NHR(棕)
  )[seq_len(n_param)]
  
  # 计算每个指标 PRC
  for (i in seq_along(result)) {
    pr <- pr.curve(
      scores.class0 = result[[i]]$scores[result[[i]]$labels == 1],
      scores.class1 = result[[i]]$scores[result[[i]]$labels == 0],
      curve = TRUE
    )
    result[[i]]$pr <- pr
  }
  
  # 画布 + 网格线(和原图一致)
  plot(0:1, 0:1, type = "n", 
       xlab = "Recall (TPR)", ylab = "Precision", 
       main = title, xlim = c(0,1), ylim = c(0,1),
       axes = FALSE)
  
  # 坐标轴
  axis(1, at = seq(0, 1, 0.2))
  axis(2, at = seq(0, 1, 0.2))
  grid(col = "lightgray", lty = "solid", nx = 5, ny = 5)
  
  # 画PR曲线
  for (i in seq_along(result)) {
    lines(result[[i]]$pr$curve[,1], 
          result[[i]]$pr$curve[,2], 
          col = colors[i], lwd = 2)
  }
  
  # 画基线(和原图一样的水平虚线)
  baseline <- mean(result[[1]]$labels == 1)
  abline(h = baseline, lty = 2, col = "gray")
  
  # 图例放在右下角(和原图位置一致)
  legend("bottomright", 
         legend = names(result), 
         col = colors, 
         lwd = 2, cex = 1.1, bty = "n")
}

features <- c("age", "ht", "dm", "smoke", "tg", "hdl", "ldl", "NLR", "NHR")
names(features) <- c("Age", "HT", "DM", "Smoke", "TG", "HDL-C", "LDL-C", "NLR", "NHR")

y_true <- dat_psm$chd_num
res_pr <- list()

for (i in seq_along(features)) {
  f <- features[i]
  res_pr[[names(features)[i]]] <- list(
    scores = dat_psm[[f]],
    labels = y_true
  )
}

plot_prc_biomarkers(res_pr, "B PR curves for individual biomarkers")

结果展示:

  1. ROC曲线
R 复制代码
library(pROC)
library(ggplot2)
library(dplyr)

features       <- c("age", "ht", "dm", "smoke", "tg", "hdl", "ldl", "NLR", "NHR")
feature_names  <- c("Age", "HT", "DM", "Smoke", "TG", "HDL-C", "LDL-C", "NLR", "NHR")
colors         <- c("#4ECDC4", "#E74C3C", "#2ECC71", "#34495E", "#FFA07A", "#5D6D7E", "#82E0AA", "#E74C3C", "#8B4513")
dat            <- dat_psm
dat$status     <- factor(dat$chd_num, levels = c(0, 1))

roc_list <- list()
auc_vals <- c()

for (i in seq_along(features)) {
  f <- features[i]
  nm <- feature_names[i]
  
  # 单因素逻辑回归
  fit <- glm(status ~ get(f), data = dat, family = binomial)
  pred <- predict(fit, type = "response")
  
  # 计算ROC
  roc_obj <- roc(dat$status, pred, quiet = TRUE)
  roc_list[[nm]] <- roc_obj
  auc_vals[i] <- auc(roc_obj)
}

legend_labels <- paste0(feature_names, " (AUC = ", sprintf("%.3f", auc_vals), ")")

ggroc(roc_list, legacy.axes = TRUE, linewidth = 1.2) +
  scale_color_manual(values = colors, labels = legend_labels) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
  labs(
    x = "1-Specificity (FPR)",
    y = "Sensitivity (TPR)"
  ) +
  theme_bw() +
  theme(
    panel.grid.major = element_line(color = "lightgray", linetype = "solid"),
    legend.position = c(0.80, 0.20),  # 图例位置和原图一致
    legend.title = element_blank(),
    legend.background = element_rect(fill = "white", color = NA)
  )

结果展示:

  1. 校准曲线
R 复制代码
library(ggplot2)

# 1. 构建数据(绝对不报错)
cal_df <- data.frame(
  predy     = cal[, "predy"],
  apparent  = cal[, "calibrated.orig"],
  corrected = cal[, "calibrated.corrected"]
)

# 2. 【最终完整版】校准曲线 + 自动图例
ggplot(cal_df) +
  # 理想线
  geom_abline(aes(color = "Ideal line"), intercept = 0, slope = 1, linetype = "dashed", linewidth = 1) +
  # 两条校准线
  geom_line(aes(x = predy, y = apparent, color = "Apparent"), linewidth = 1.2) +
  geom_line(aes(x = predy, y = corrected, color = "Bias corrected"), linewidth = 1.2) +
  
  # 颜色严格匹配你要的
  scale_color_manual(
    name = NULL,
    values = c(
      "Apparent" = "#4ECDC4",
      "Bias corrected" = "#E74C3C",
      "Ideal line" = "gray"
    )
  ) +
  
  # 坐标轴
  labs(x = "Predicted Probability", y = "Observed Probability") +
  xlim(0, 1) + ylim(0, 1) +
  
  # 主题 + 图例位置
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    legend.position = c(0.85, 0.05),  # 右下角
    legend.background = element_rect(fill = "white", color = NA)
  )

结果展示:

  1. ROC曲线
R 复制代码
library(ggplot2)
library(pROC)

# 计算 ROC、AUC、CI
pred_multi <- predict(fit, type = "fitted")
roc_obj <- roc(dat_psm$chd_num ~ pred_multi, quiet = TRUE)
auc <- round(auc(roc_obj), 3)
ci <- round(ci.auc(roc_obj), 3)

# 图例文字(自动换行)
legend_label <- paste0("Model\nAUC = ", auc,"\n95%CI: ",ci[1],"-",ci[3])

# ✅ 最终版:ggroc + 右对齐 + 黑色文字 + 无网格
ggroc(roc_obj, legacy.axes = TRUE, color = "#4ECDC4", linewidth = 1.2) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray", linewidth = 1) +
  # 右对齐 + 黑色文字
  annotate("text", x = 0.99, y = 0.05, hjust = 1, size = 4.2, color = "black", label = legend_label) +
  labs(x = "1 - Specificity", y = "Sensitivity") +
  xlim(0, 1) + ylim(0, 1) +
  theme_bw() +
  theme(panel.grid = element_blank(), legend.position = "none")

结果展示:

四、小结

本章以 NAFLD 合并 CHD 共病风险研究为切入点,依托模拟临床数据集,完整复现了当下高分临床共病研究的标准化分析全流程,串联起传统临床统计、混杂因素校正、非线性关系挖掘与可解释机器学习的全套研究范式,实现从变量筛选、模型构建到结果可视化验证的闭环落地。

首先,研究基于真实临床分布规律构建模拟数据,纳入人口学、基础生化、血常规等核心指标,批量计算 SII、SIRI、NHR、NLR 等 10 项临床常用免疫炎症复合标志物,完整还原真实回顾性研究的数据结构与变量体系。在此基础上,通过基线特征统计完成两组人群临床资料的组间对比,清晰反映 NAFLD 单纯组与合并 CHD 共病组在炎症、代谢指标上的基线差异,为后续分析奠定基础。

在传统统计层面,先通过单因素与多因素 Logistic 回归筛选共病发生的独立危险因素,明确 NHR、NLR、SIRI、PNR 等炎症指标的临床预测价值,并结合森林图直观展示各指标的效应量与置信区间;同时采用 LASSO 回归压缩降维、筛选核心混杂变量,搭配 1:1 倾向得分匹配(PSM)平衡组间混杂偏倚,最大程度降低回顾性研究的选择偏倚,提升结论的可靠性。

针对临床指标非线性关联的研究需求,本研究利用限制性立方样条(RCS)模型,探究各炎症标志物与 CHD 发病风险的线性 / 非线性剂量反应关系;结合 Spearman 相关性分析,进一步验证炎症指标与 Gensini 冠脉病变严重程度评分的关联强度,从 "发病风险" 与 "疾病严重程度" 双维度,阐明免疫炎症紊乱在共病进展中的作用。

在进阶建模与人工智能解读部分,统一构建 GLM、SVM、随机森林三种预测模型,通过残差分布对比优选模型;依托 SHAP 可解释分析,从特征贡献度、分组差异、单样本归因多角度拆解模型机制,规避传统黑箱 AI 难以临床解读的短板。同时结合列线图、ROC 曲线、PR 曲线、校准曲线多维度评价模型区分度、精准度与拟合效果,形成一套规范、完整的预测模型验证体系。

到这里,本篇 NAFLD+CHD 共病 + 炎症标志物 + 可解释机器学习高分 SCI 完整复刻就全部结束了,整套流程从数据构建、统计分析、非线性探索,到机器学习建模、SHAP 结果解读、模型多维度验证一步不落、全套打通,完全复刻顶刊研究标准范式;后续想看哪些方向的论文复刻,比如代谢共病、肿瘤预后、影像组学、列线图预测模型、免疫评分、孟德尔随机化等选题,都可以在评论区留言,点赞收藏越多更新速度越快,我也会持续拆解高分 SCI 研究套路,分享可直接套用的全套科研代码与分析流程。

相关推荐
yu85939581 小时前
利用MATLAB进行木材图像去噪
开发语言·算法·matlab
xiangzhihong81 小时前
Claude Code系列教程之Claude Code 基础用法基础用法
人工智能
deephub1 小时前
2026年的 ReAct Agent架构解析:原生 Tool Calling 与 LangGraph 状态机
人工智能·大语言模型·agent·langgraph
报错小能手1 小时前
Swift EventBus讲解
开发语言·ios·swift
aq55356001 小时前
GitSubmodule深度避坑指南
java·开发语言·php
止语Lab2 小时前
Go 的测试框架不想让你 TDD
开发语言·golang·tdd
yaoxin5211232 小时前
391. Java 文件操作基础 - 方法链式调用
java·开发语言·python
淡海水2 小时前
【AI模型】概念-Token
人工智能·算法
数智化精益手记局2 小时前
什么是安全生产?解读安全生产的基本方针与核心要求
大数据·运维·人工智能·安全·信息可视化·自动化·精益工程