Scanpy 富集分析实战：gseapy 从基因列表到通路解读

差异表达分析跑完了，拿到一列 DEG------然后呢？

基因名本身没有意义，通路才有意义。富集分析就是把你的一列基因翻译成生物学家能理解的"故事"。

Python 生态里做富集分析最好用的工具是 gseapy，今天完整拆解。

1. 安装与基本用法

复制代码

pip install gseapy pandas matplotlib

ini 复制代码

import gseapy as gp
import pandas as pd

# 方法1：直接传基因列表（Over-Representation Analysis, ORA）
gene_list = ['CD79A', 'MS4A1', 'CD19', 'CD22', 'PAX5',
             'EBF1', 'BANK1', 'FCRL5', 'TCL1A', 'BLNK']

enr = gp.enrichr(
    gene_list=gene_list,
    organism='Human',
    gene_sets=['GO_Biological_Process_2023', 'KEGG_2021_Human'],
    outdir='enrichr_results',
    cutoff=0.5
)

# 查看结果
print(enr.res2d.head(10))

2. GSEA 分析（需要排序列表）

GSEA 比 ORA 更严谨------不需要硬性阈值切分基因。

ini 复制代码

# 从 Scanpy 提取排好序的基因列表
import scanpy as sc
import numpy as np

adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, groupby='louvain', method='wilcoxon')

# 提取某个 cluster 的 DEG 并按 logFC 排序
df = sc.get.rank_genes_groups_df(adata, group='1')  # B cells
df = df.dropna(subset=['logfoldchanges', 'pvals_adj'])
df = df.sort_values('logfoldchanges', ascending=False)

# 构造 GSEA 需的 ranked gene list
rnk = pd.Series(
    df['logfoldchanges'].values,
    index=df['names'].values
)

# GSEA
gs = gp.prerank(
    rnk=rnk,
    gene_sets=['GO_Biological_Process_2023', 'KEGG_2021_Human'],
    outdir='gsea_results',
    permutation_num=1000,
    seed=42,
    min_size=10,
    max_size=500
)

# 经典 GSEA 山脊图
from gseapy.plot import gseaplot
terms = gs.res2d.Term[:3]
for term in terms:
    gseaplot(gs.res2d, term=term, ofname=f'gsea_{term[:20]}.pdf')

3. 多个 Cluster 批量富集

ini 复制代码

# 批量分析每个 cluster 的 Top DEG
all_results = []

for cluster in adata.obs['louvain'].unique():
    df = sc.get.rank_genes_groups_df(adata, group=cluster)
    top_genes = df[df['pvals_adj'] < 0.05].head(200)['names'].tolist()

    if len(top_genes) < 5:
        continue

    enr = gp.enrichr(
        gene_list=top_genes,
        organism='Human',
        gene_sets=['GO_Biological_Process_2023'],
        outdir=f'enrichr_cluster_{cluster}',
        cutoff=0.5,
        no_plot=True
    )

    result = enr.res2d.copy()
    result['cluster'] = cluster
    all_results.append(result)

# 合并所有结果
combined = pd.concat(all_results, ignore_index=True)
combined.to_csv('enrichment_all_clusters.csv', index=False)

4. 可视化

ini 复制代码

import matplotlib.pyplot as plt

# ── 气泡图 ──
from gseapy import dotplot
dotplot(enr.res2d, title='GO Enrichment', cutoff=0.05,
        top_term=15, figsize=(6, 8))
plt.savefig('enrichment_dotplot.pdf', dpi=300, bbox_inches='tight')

# ── 条形图 ──
from gseapy import barplot
barplot(enr.res2d, title='KEGG Pathway', top_term=10,
        figsize=(6, 6), color='steelblue')
plt.savefig('enrichment_barplot.pdf', dpi=300, bbox_inches='tight')

# ── 多 Cluster 通路热图 ──
# 把各 cluster 的 top 通路整理成矩阵
pivot = combined.pivot_table(
    index='Term', columns='cluster',
    values='Adjusted P-value', aggfunc='min'
)
pivot = -np.log10(pivot + 1e-10)  # 取 -log10(pvalue)

import seaborn as sns
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(pivot.head(20), cmap='Reds', annot=True, fmt='.1f',
            linewidths=0.5, ax=ax)
ax.set_title('Enrichment Heatmap by Cluster')
plt.savefig('enrichment_heatmap.pdf', dpi=300, bbox_inches='tight')

5. 常用基因集数据库

数据库	适用场景	gseapy 名称
GO BP	通用生物过程	`GO_Biological_Process_2023`
GO CC	细胞组分	`GO_Cellular_Component_2023`
GO MF	分子功能	`GO_Molecular_Function_2023`
KEGG	代谢通路	`KEGG_2021_Human`
Reactome	信号通路	`Reactome_2022`
MSigDB	癌症 hallmark	`MSigDB_Hallmark_2020`
CellMarker	细胞标记	`Cell_Marker_2022`

富集分析是从"基因列表"到"生物学故事"的桥梁。但通路解读本身需要领域知识------不是所有显著通路都跟你研究相关，需要人工筛选和判断。