Seurat V5 结构树和基础整合pipeline

Seurat V5 相较于 V4 版本引入了 Layer 架构，使得多样本整合分析更加灵活高效。但是导致目前存在的R包、函数各种混乱，我对于结构树和基础分析pipeline做了一点整理

快餐在2.1和8 😜

1. Layer 架构详解

1.1 Layer 的本质与设计理念

在 Seurat V5 中，每个 Assay（检测类型）可以存储多个 Layer，这是 V5 版本最核心的创新。理解 Layer 的概念对于掌握 V5 的分析流程至关重要。与 V4 版本将每个 Assay 的数据存储在固定的 slot（counts、data、scale.data）中不同，V5 引入了更加灵活的 Layer 结构，允许研究者在同一个 Assay 下管理多个独立的数据矩阵。
这种设计的核心优势在于：当处理多个样本时，V5 默认将每个样本的数据放入独立的 Layer，而非直接合并为单一矩阵。这种方式保留了每个样本的原始信息，便于后续的批次效应校正和整合分析。例如，当你读取 10 个独立样本并合并成一个 Seurat 对象后，会得到 RNA Assay 下包含 10 个 Layer 的数据结构，每个 Layer 对应一个样本的表达矩阵。
Layer 的关键特性包括：
- 所有 Layer 共享同一个 meta.data；
- 每个 Layer 的行（基因维度）一致，但列（细胞维度）可以不同；
- Layer 不是新的 Assay，而是同一 Assay 下的多个表达矩阵视图。

1.2 Layer 数据结构示例

复制代码

Assay: RNA
  ├─ counts.Sample01     ← 样本1的原始计数矩阵
  ├─ counts.Sample02     ← 样本2的原始计数矩阵
  ├─ counts.Sample03     ← 样本3的原始计数矩阵
  └─ ...                 ← 更多样本

执行 JoinLayers() 合并操作后，这些独立的 Layer 会被整合为标准的 counts、data、scale.data 三个核心数据层，便于后续的降维和聚类分析。

2. Seurat 对象结构树

2.1 整体架构概览

复制代码

Seurat-object
├─ meta.data                              # 细胞 / spot 级注释（操作入口）
│  ├─ orig.ident
│  ├─ nCount_RNA
│  ├─ nFeature_RNA
│  ├─ nCount_Spatial
│  ├─ nFeature_Spatial
│  ├─ array_row / array_col               # Visium 阵列坐标
│  ├─ tissue                              # 是否在组织上
│  ├─ anterior1_imagerow                  # ← 来自 images 的投影
│  └─ anterior1_imagecol                  # ← 前缀来自stOBJ@assays$Spatial@key
│
├─ assays                                 # ⭐ 分子计数与特征空间
│  ├─ RNA                                # ── 转录组（scRNA / snRNA）
│  │  └─ layers【V5：有layers，无slot】 # V5 :一个layer代表一个样本
│  │  │    ├─ counts   	                 # raw counts，    JoinLayers()合并后
│  │  │    ├─ data   	                 # log-normalized，JoinLayers()合并后
│  │  │    ├─ scale.data	             # z-score，       JoinLayers()合并后
│  │  │    ├─ spliced / unspliced        # velocyto
│  │  │    ├─ corrected                  # SCT / Harmony      
│  │  │    ┊    
│  │  │    ┊ # 【JoinLayers()合并前】
│  │  │    ├─ counts.01	                 # raw counts 01   JoinLayers()合并前
│  │  │    ├─ counts.02	                 # raw counts 02
│  │  │    ├─ counts.0...	             # raw counts ...
│  │  │    ├─ data.01	                 # log-normalized 01
│  │  │    ├─ data.02	                 # log-normalized 02
│  │  │    ├─ data.0...	                 # log-normalized ...
│  │  │    ├─ .....                      # (其他同理)
│  │  ┊
│  │  ┊ # 【V4对象：RNA下就是slots。这里用虚线，以示和V5的对比】
│  │  ├─ @counts 【V4：没有layer,有slots】 # raw counts 【slot】
│  │  └─ @data                           # log-normalized【slot】
│  │  └─ @scale.data                     # z-score【slot】
│  │  └─ @var.features                   # 基因
│  │  └─ @meta.freatures                 # barcodes
│  │
│  ├─ Spatial                             # ── 空间转录组（Visium）
│  │  └─ layers (V5)
│  │     ├─ counts                        # spot × gene
│  │     ├─ data
│  │     ├─ scale.data
│  │     └─ SCT                           # SCTransform 校正层
│  │
│  ├─ ADT                                 # ── CITE-seq 蛋白
│  │  └─ layers (V5)
│  │     ├─ counts
│  │     ├─ data
│  │     └─ scale.data
│  │
│  ├─ ATAC                                # ── 染色质可及性
│  │  └─ layers (V5)
│  │     ├─ counts                        # peak × cell
│  │     ├─ data
│  │     ├─ scale.data
│  │     └─ TF-IDF
│  ....
│
├─ images                                 # ⭐ 空间几何层（与 assays 正交）
│  └─ anterior1                           # slice / library_id
│     ├─ @coordinates                     # spot ↔ 像素 / 阵列映射
│     │  ├─ imagerow / imagecol
│     │  ├─ row / col
│     │  └─ tissue
│     │
│     ├─ @image                           # 组织切片图像
│     │  ├─ lowres.png
│     │  ├─ hires.png
│     │  └─ scale.factors.json
│     │
│     ├─ @scale.factors                   # spot 与像素缩放关系
│     │  ├─ spot
│     │  ├─ fiducial
│     │  ├─ hires
│     │  └─ lowres
│     │
│     └─ @key                             # 坐标前缀（如 "anterior1_"）
│
├─ reductions                             # 低维嵌入（分子 / 空间）
│  ├─ pca                                 # RNA / SCT
│  ├─ umap                                # RNA / ADT / MULTI
│  ├─ lsi                                 # ATAC
│  └─ spatial                             # SpatialPCA / BayesSpace
│
└─ tools                                  # 空间与整合分析产物
   ├─ VisiumV1 / VisiumV2
   │  ├─ boundaries                       # spot / cell 边界
   │  ├─ centroids
   │  └─ segmentation
   │
   └─ sketch / integration / anchors

2.2 V4 与 V5 结构差异

理解 V4 与 V5 的结构差异对于处理不同版本的对象非常重要。V4 版本中，RNA Assay 直接包含 @counts、@data、@scale.data 等固定 slot；而 V5 版本中，这些数据以 Layer 的形式组织在同一 Assay 下。值得注意的是，V5 仍然保留了与 V4 对象的向后兼容性，当读取 V4 创建的对象时，系统会自动将其转换为 V5 的 Layer 格式。

V4 结构（无 Layer）：

r 复制代码

# V4 使用固定的 slot 结构
RNA@counts        # 原始计数
RNA@data          # 标准化数据
RNA@scale.data    # 缩放数据

V5 结构（有 Layer）：

r 复制代码

# V5 使用灵活的 Layer 结构
Layers(obj, assay = "RNA")  # 查看所有 Layer
# 返回: counts.01, counts.02, data.01, data.02, ...

3. Layer 的拆分与合并

3.1 拆分 Layer

r 复制代码

# 基于样本标识拆分对象
scRNA_objs <- split(scRNA, f = scRNA$orig.ident)

# 验证拆分结果
names(scRNA_objs)
# 返回包含所有样本名的列表

# 检查每个子对象的细胞数
table(scRNA_objs[[1]]$orig.ident)

3.2 合并 Layer

如果要进行多样本整合，V5已有IntegrateLayers()方法，不用额外JoinLayers()，见6. 批次效应整合方法

r 复制代码

# 合并所有 Layer 为单一对象
scRNA_merged <- JoinLayers(scRNA_objs)

# 验证合并结果
table(scRNA_merged$orig.ident)
Layers(scRNA_merged, assay = "RNA")
# 应该显示合并后的 counts, data, scale.data

4. 数据访问函数

4.1 GetAssayData() 函数详解

GetAssayData() 是访问 Seurat 对象中表达矩阵的核心函数，它允许用户灵活地从特定 Layer 中提取数据。在 V5 版本中，该函数通过 layer 参数指定目标 Layer，支持访问单个样本或合并后的数据。理解这个函数的用法对于数据探索和下游分析至关重要。

r 复制代码

# 访问单个样本的原始计数
counts_sample01 <- GetAssayData(
  object = scRNA_objs,
  assay = "RNA",
  layer = "counts.Sample01"  # 指定样本 Layer
)

# 访问合并后的标准化数据（需先执行 JoinLayers）
data_merged <- GetAssayData(
  object = scRNA_merged,
  assay = "RNA",
  layer = "data"
)

# 访问特定基因在特定细胞中的表达
subset_expr <- GetAssayData(
  object = scRNA,
  assay = "RNA",
  layer = "data",
  features = c("CD3D", "CD4", "CD8A"),  # 指定基因列表
  cells = colnames(scRNA)[1:100]         # 指定细胞列表
)

# layer 参数说明
# - 未合并时: "counts.Sample01", "data.Sample01"
# - 已合并后: "counts", "data", "scale.data"

4.2 Layers() 函数

Layers() 函数用于查看和管理对象中的所有 Layer，是了解数据结构的第一步。通过这个函数可以快速了解对象中包含哪些 Layer，以及它们的当前状态。

r 复制代码

# 查看所有 Layer
all_layers <- Layers(obj)

# 查看特定 Assay 的 Layer
rna_layers <- Layers(obj, assay = "RNA")

# 过滤特定类型的 Layer
count_layers <- Layers(obj, assay = "RNA") %>% 
  grep("^counts", ., value = TRUE)

5. 多 Layer 对象的预处理行为

5.1 独立处理与联合处理

在未执行 JoinLayers() 的情况下，对多 Layer 对象执行标准化、高可变基因筛选、标准化和 PCA 等操作时，每个 Layer 会独立进行计算。这意味着每个样本会挑选出自己的高可变基因，计算独立的 PCA 结果。这种方式的优势在于可以保留样本间的异质性，但缺点是无法得到统一的全数据集特征。

r 复制代码

# 对多 Layer 对象独立执行预处理
obj <- NormalizeData(obj)                    # 每层独立标准化
obj <- FindVariableFeatures(obj)             # 每层独立筛选 HVG
obj <- ScaleData(obj)                        # 每层独立缩放
obj <- RunPCA(obj)                           # 每层独立计算 PCA

# 结果：每个样本有独立的 HVG 列表和 PCA 嵌入

5.2 推荐的预处理策略

对于多样本分析项目，推荐的策略是先对每个样本独立进行质控和预处理，然后合并 Layer，最后执行批次效应校正。这种方法可以在保留样本特异性的同时，实现跨样本的比较分析。以下是完整的工作流程建议：

第一阶段：样本级处理

对每个样本单独进行 QC 过滤
根据样本特性调整过滤参数
独立执行 NormalizeData ， FindVariableFeatures ，ScaleData 和RunPCA

第二阶段：合并与整合

使用 JoinLayers 合并所有样本
使用 Harmony 或其他方法进行批次校正
统一执行FindNeighbors , FindClusters, RunUMAP

6. 批次效应整合方法

整合前各个layer应该已经进行完毕RunPCA

6.1 Harmony 整合

Harmony 是一种基于迭代校正的批次效应整合算法，它通过优化细胞在不同批次间的嵌入距离来实现批次效应的移除。Harmony 的优势在于计算效率高、对超大规模数据集友好，是目前最广泛使用的整合方法之一。

r 复制代码

# 使用 Harmony 进行批次效应整合
obj <- IntegrateLayers(
  object = obj,
  method = HarmonyIntegration,
  orig.reduction = "pca",
  new.reduction = "harmony",
  theta = 3,           # 惩罚参数，值越大惩罚越强
  verbose = FALSE
)

# 基于整合结果进行聚类和可视化
obj <- FindNeighbors(obj, dims = 1:30, reduction = "harmony") %>%
  FindClusters(resolution = 2, cluster.name = "harmony_clusters") %>%
  RunUMAP(dims = 1:30, reduction = "harmony", reduction.name = "harmony")

# 可视化整合效果
DimPlot(obj, reduction = "harmony", group.by = c("stim", "seurat_clusters"))

6.2 CCA 整合（典型相关分析）

CCA 方法通过识别跨批次间高度相关的特征空间来实现整合，适用于批次间存在生物学差异但整体表达模式相似的数据集。

r 复制代码

# 使用 CCA 进行整合
obj <- IntegrateLayers(
  object = obj,
  method = CCAIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.cca",
  normalization.method = "SCT",  # 推荐使用 SCT 标准化
  verbose = FALSE
)

6.3 RPCA 整合（互近邻 reciprocal PCA）

RPCA 是一种更快、更稳健的整合方法，特别适合处理高度异质的数据集，它通过在共享的特征空间中寻找互近邻对应关系来实现整合。

r 复制代码

# 使用 RPCA 进行整合
obj <- IntegrateLayers(
  object = obj,
  method = RPCAIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.rpca",
  normalization.method = "SCT",
  verbose = FALSE
)

6.4 FastMNN 整合

FastMNN（Fast Mutual Nearest Neighbors）是基于批量校正的方法，通过识别批次间的细胞配对来实现整合。该方法需要安装 SeuratWrappers 和 batchelor 包。

r 复制代码

# 安装必要的包
if (!requireNamespace("SeuratWrappers", quietly = TRUE)) {
  devtools::install_github("satijalab/seurat-wrappers", 
                           ref = "seurat5", force = TRUE, upgrade = "never")
}
if (!requireNamespace("batchelor", quietly = TRUE)) {
  BiocManager::install("batchelor")
}
library(SeuratWrappers)

# 执行 FastMNN 整合
obj <- IntegrateLayers(
  object = obj,
  method = FastMNNIntegration,
  new.reduction = "integrated.mnn",
  normalization.method = "SCT",
  verbose = FALSE
)

6.5 scVI 整合（深度学习方法）

scVI（single-cell Variational Inference）是一种基于变分自编码器的深度学习方法，能够有效处理复杂的批次效应和稀疏数据。该方法需要在 Python 环境中安装 scvi-tools 工具包。

r 复制代码

# 安装 Python 依赖
# 参考: https://docs.scvi-tools.org/en/stable/installation.html

# 使用 scVI 进行整合
obj <- IntegrateLayers(
  object = obj,
  method = scVIIntegration,
  new.reduction = "integrated.scvi",
  conda_env = "../miniconda3/envs/scvi-env",  # conda 环境路径
  normalization.method = "SCT",
  verbose = FALSE
)

6.6 整合方法选择

我不专业，不做深入讨论

Harmony 适合大多数常规分析，计算效率高；
CCA 适合批次间差异较小的情况；
RPCA 适合数据质量较好、需要快速整合的场景；
FastMNN 在处理复杂批次效应时表现优异；
scVI 适合对整合质量要求高、计算资源充足的项目。

7. 整合后分析流程

完成批次效应整合后，需要基于整合后的降维空间进行聚类分析。聚类分辨率的选择需要根据预期的细胞类型数量和数据集复杂度进行调整，通常从 0.4 到 2.0 之间探索。

r 复制代码

# 基于整合后的降维结果进行聚类
obj <- FindNeighbors(obj, reduction = "integrated.rpca", dims = 1:30) %>%
  FindClusters(resolution = 0.6) %>%  # 调整分辨率获得合适的聚类数
  RunUMAP(dims = 1:30, reduction = "integrated.rpca")

# 可视化聚类结果，按样本来源和细胞类型着色
DimPlot(obj, reduction = "integrated.rpca", 
        group.by = c("stim", "seurat_annotations"))

# 检查批次效应校正效果
DimPlot(obj, reduction = "umap", group.by = "stim", split.by = "stim")

8. 完整pipeline

r 复制代码

# 步骤1: 查看当前 Layer 结构
Layers(obj)

# 步骤2: 拆分为样本级对象列表
sample_list <- lapply(Layers(obj), function(lyr) {
  mat <- GetAssayData(obj, layer = lyr, assay = "RNA")
  sobj <- CreateSeuratObject(counts = mat, project = lyr)
  
  # 样本级质控（个性化处理,这里可能要分别可视化及QC处理，不宜循环）
  sobj[["percent.mt"]] <- PercentageFeatureSet(sobj, pattern = "^MT-")
  sobj <- subset(sobj, 
                 subset = nFeature_RNA > 200 & nFeature_RNA < 5000 
                          & percent.mt < 20)
    
  # 样本级标准化和特征选择
  sobj <- NormalizeData(sobj)
  sobj <- FindVariableFeatures(sobj, selection.method = "vst", nfeatures = 2000)
  sobj <- ScaleData(sobj, features = VariableFeatures(sobj)) # regress_out进行周期矫正，如需要
  sobj <- RunPCA(sobj, features = VariableFeatures(sobj))
  # - UMAP/DoubletFinder去除双细胞（如需要）
  
  return(sobj)
})
names(sample_list) <- Layers(obj)

# 步骤3: 合并所有样本
scRNA_merge <- merge(x = sample_list[[1]],
                     y = sample_list[c(2:length(sample_list))],
                     add.cell.ids = names(sample_list))

# 步骤4: 合并后的联合分析
scRNA_merge <- scRNA_merge %>%
  NormalizeData() %>%
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
  ScaleData(features = VariableFeatures(.)) %>%
  RunPCA(features = VariableFeatures(.))

# 步骤5: 批次效应整合（Harmony 示例）
scRNA_merge <- IntegrateLayers(
  object = scRNA_merge,
  method = HarmonyIntegration,
  orig.reduction = "pca",
  new.reduction = "harmony",
  theta = 3
)

# 步骤6: 聚类与可视化
scRNA_merge <- FindNeighbors(scRNA_merge, dims = 1:30, reduction = "harmony") %>%
  FindClusters(resolution = 2, cluster.name = "harmony_clusters") %>%
  RunUMAP(dims = 1:30, reduction = "harmony", reduction.name = "harmony")

# 最终可视化
DimPlot(scRNA_merge, reduction = "harmony", 
        group.by = c("orig.ident", "harmony_clusters"))

9. 常见问题与注意事项

9.1 Layer 相关问题

问题：如何判断对象是否已合并？

r 复制代码

# 检查 Layer 结构
Layers(obj)
# 未合并: 显示多个 counts.xxx, data.xxx 等
# 已合并: 显示 counts, data, scale.data

问题1：V4 对象如何升级到 V5？

Seurat V5 会自动识别 V4 格式的对象，并在首次访问时自动升级，一般无需手动转换。但建议保存 V5 格式的对象以确保兼容性。

r 复制代码

#手动方案1：
obj=UpdateSeuratObject(obj) # 更新Seurat对象到最新版本v5，确保兼容性
#方案2：
obj[["RNA_v5"]] <- as(object = obj[["RNA"]], Class = "Assay5")

问题2：V5如何转为V4？

r 复制代码

obj[["RNA_v3"]] <- as(object = obj[["RNA"]], Class = "Assay") # convert a v5 assay to a v4 assay

9.2 内存管理建议

处理大型单细胞数据集时，内存管理至关重要。以下是一些优化建议：

首先，只保留当前分析步骤所需的数据层；
其次，定期保存中间结果以防意外中断；
第三，对于超大数据集，考虑使用 SCT 标准化而非标准 NormalizeData 以提高效率。

参考资源

Seurat 官方文档：https://satijalab.org/seurat
Seurat V5 更新说明：https://satijalab.org/seurat/articles/seurat5