单细胞最好的教程(十八): 细胞类型映射到细胞本体论:让你的单细胞注释更专业!

细胞类型映射到细胞本体论:让你的单细胞注释更专业!

作者按

在单细胞数据分析领域,标准化的细胞类型注释对于数据整合和比较研究至关重要。本文将介绍如何使用Cell Ontology(细胞本体论)来规范化你的细胞类型注释,提高研究的可重复性和可比性。本教程首发于单细胞最好的中文教程,未经授权许可,禁止转载。

全文字数|预计阅读时间: ~4000 | 5min

------Starlitnightly(星夜)

🔍 什么是细胞本体论(Cell Ontology)?

细胞本体论(CL)是一个专门用于分类和描述不同生物体中细胞类型的标准化系统。作为模式生物和生物信息学数据库的重要资源,它具有以下特点:

  • 📚 包含超过2700种动物细胞类型的详细分类
  • 提供高层次的细胞类型分类标准
  • 可以作为其他物种(如植物本体论或果蝇解剖学本体论)中细胞类型的映射参考
  • 与其他本体论(如Uberon、GO、CHEBI、PR和PATO)无缝集成
  • 能够将细胞类型与解剖结构、生物过程等相关概念建立联系

💡 提示:使用标准化的细胞本体论可以大大提高你的研究结果在国际上的认可度和引用率!

细胞本体 (CL) 创建于 2004 年,自 OBO Foundry 成立以来一直是其核心本体。自那时起,CL 已被各种项目采用,包括 HuBMAP 项目、人类细胞图谱 (HCA)、cellxgene 平台、单细胞表达图谱、BRAIN 倡议细胞普查网络 (BICCN)、ArrayExpress、细胞图像库 (The Cell Image Library)、ENCODE 和 FANTOM5,用于注释细胞类型并促进细胞参考图谱绘制

在这里,我们提供了几个强大的函数,可以将你注释的细胞名称智能转换为对应的细胞本体论名称和ID。所有分析都通过omicverse.single.CellOntologyMapper类来完成。让我们开始动手实践吧!

python 复制代码
import scanpy as sc
#import pertpy as pt
import omicverse as ov
ov.plot_set()

%load_ext autoreload
%autoreload 2

📊 数据准备

在开始转换细胞名称之前,你需要先完成细胞注释。在本教程中,我们使用了来自pertpy的haber_2017_regions数据集作为示例。这是一个来自小肠的单细胞测序数据集,包含了多种上皮细胞类型。

python 复制代码
import pertpy as pt
adata = pt.dt.haber_2017_regions()
adata.obs['cell_label'].unique()
复制代码
['Enterocyte.Progenitor', 'Stem', 'TA.Early', 'TA', 'Tuft', 'Enterocyte', 'Goblet', 'Endocrine']
Categories (8, object): ['Endocrine', 'Enterocyte', 'Enterocyte.Progenitor', 'Goblet', 'Stem', 'TA', 'TA.Early', 'Tuft']

⬇️ 下载CL模型

在开始分析之前,我们需要从Cell Ontology下载cl.json文件。这个文件包含了完整的细胞本体论数据库。我们提供了多种下载方式:

方式一:命令行下载

shell 复制代码
# 从OBO页面下载cl.ono
!mkdir new_ontology
!wget http://purl.obolibrary.org/obo/cl/cl.json -O new_ontology/cl.json

方式二:自动下载(推荐新手)

我们提供了一个名为omicverse.single.download_cl()的函数来自动完成下载过程。这个函数特别智能,即使遇到网络问题,它也能自动选择最佳的下载源。

方式三:手动下载(网络受限时的备选方案)

如果你的网络访问受限,可以使用以下链接手动下载:

python 复制代码
ov.single.download_cl(output_dir="new_ontology", filename="cl.json")
复制代码
Downloading Cell Ontology to: new_ontology/cl.json
============================================================

[1/3] Trying Official OBO Library...
    URL: http://purl.obolibrary.org/obo/cl/cl.json
    Description: Direct download from official Cell Ontology
    → Downloading...

🛠️ 配置CellOntologyMapper

CellOntologyMapper的核心是基于SentenceTransformer的NLP嵌入模型。选择合适的模型对于映射效果至关重要:

模型名称 特点 适用场景
BAAI/bge-base-en-v1.5 性能最优 需要高精度的正式分析
BAAI/bge-small-en-v1.5 速度快 快速测试或小规模数据
sentence-transformers/all-MiniLM-L6-v2 平衡型 日常分析使用

你也可以在huggingface的官网找到更多的模型:https://hf-mirror.com/models?library=sentence-transformers

💡 小贴士:如果你的计算资源充足,建议使用BAAI/bge-base-en-v1.5获得最佳效果。

python 复制代码
# 
mapper = ov.single.CellOntologyMapper(
    cl_obo_file="new_ontology/cl.json",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    local_model_dir="./my_models"
)
复制代码
🔨 Creating ontology resources from OBO file...
📖 Parsing ontology file...
🧠 Creating NLP embeddings...
🔄 Loading model sentence-transformers/all-MiniLM-L6-v2...
🌐 Checking network connectivity...
✓ Network connection available
🇨🇳 Using HF-Mirror (hf-mirror.com) for faster downloads in China
📁 Models will be saved to: ./my_models
🪞 Downloading model from HF-Mirror: sentence-transformers/all-MiniLM-L6-v2
✓ Model loaded successfully from HF-Mirror!
🔄 Encoding 16841 ontology labels...

你也可以直接加载运行计算好的细胞本体的嵌入,这对于cpu用户而言特别有帮助。

python 复制代码
mapper = ov.single.CellOntologyMapper(
    cl_obo_file="new_ontology/cl.json",
    embeddings_path='new_ontology/ontology_embeddings.pkl',
    local_model_dir="./my_models"
)
复制代码
📥 Loading existing ontology embeddings...
📥 Loaded embeddings for 16841 ontology labels
📋 Ontology mappings loaded: 16841 cell types

细胞类型名称映射

我们可以使用 map_adata 来直接映射我们的细胞类型,并且我们可以可视化

python 复制代码
mapping_results = mapper.map_adata(
    adata, 
    cell_name_col='cell_label'
)

🤖 使用LLM辅助细胞类型映射

在实际工作中,研究者经常使用缩写来命名细胞类型(比如TA代表Transit Amplifying cell,EC代表Endothelial cell)。这些缩写可能会影响与细胞本体论的匹配效果。为解决这个问题,我们创新性地引入了LLM(大语言模型)来智能解析这些缩写。

配置参数说明:

参数 说明 示例
api_type API类型 openai, anthropic, ollama
tissue_context 组织来源 "gut", "brain", "liver"
species 研究物种 "mouse", "human", "rat"
study_context 研究背景 "肠道上皮细胞单细胞测序"
api_key API密钥 "sk-..."

⚠️ 安全提示:请妥善保管你的API密钥,不要将其暴露在公开环境中。

python 复制代码
mapper.setup_llm_expansion(
    api_type="openai", model='gpt-4o-2024-11-20',
    tissue_context="gut",    # 组织上下文
    species="mouse",                   # 物种信息
    study_context="Epithelial cells from the small intestine and organoids of mice. Some of the cells were also subject to Salmonella or Heligmosomoides polygyrus infection",
    api_key="sk-*"
)
复制代码
✓ Loaded 10 cached abbreviation expansions
✓ LLM expansion functionality setup complete (Type: openai, Model: gpt-4o-2024-11-20)
🧬 Tissue context: gut
🔬 Study context: Epithelial cells from the small intestine and organoids of mice. Some of the cells were also subject to Salmonella or Heligmosomoides polygyrus infection
🐭 Species: mouse

你可以选择任何符合openai规则的api作为输入,例如ohmygpt.

python 复制代码
mapper.setup_llm_expansion(
    api_type="custom_openai",
    api_key="sk-*",
    model="gpt-4.1-2025-04-14",
    base_url="https://api.ohmygpt.com/v1"
)

大语言模型辅助映射

python 复制代码
mapping_results = mapper.map_adata_with_expansion(
    adata=adata,
    cell_name_col='cell_label',
    threshold=0.5,
    expand_abbreviations=True  # 启用缩写扩展
)
mapper.print_mapping_summary(mapping_results, top_n=15)
复制代码
  🔤 Identified potential abbreviation: Stem
  🔤 Identified potential abbreviation: TA.Early
  🔤 Identified potential abbreviation: TA
  🔤 Identified potential abbreviation: Tuft
  🔤 Identified potential abbreviation: Goblet

🤖 Expanding 5 abbreviations using LLM...
  📝 [1/5] Expanding: Stem
    ✓ → Intestinal stem cell (Confidence: high)
    💡 Alternatives: Stem cell, Crypt stem cell
  📝 [2/5] Expanding: TA.Early
    ✓ → Transit Amplifying Early cell (Confidence: high)
    💡 Alternatives: Transit Amplifying progenitor cell (early stage), Transient Amplifying Early cell
  📝 [3/5] Expanding: TA
    ✓ → Transit amplifying cell (Confidence: high)
    💡 Alternatives: Tumor-associated cell, T cell activation-related cell
  📝 [4/5] Expanding: Tuft
    ✓ → Tuft cell (Confidence: high)
    💡 Alternatives: Brush cell
  📝 [5/5] Expanding: Goblet
    ✓ → Goblet cell (Confidence: high)


✓ Tuft -> tuft cell (Similarity: 0.787)
✓ Enterocyte -> enterocyte (Similarity: 0.776)
✓ TA -> transit amplifying cell of appendix (Similarity: 0.741)
✓ Stem -> intestinal crypt stem cell (Similarity: 0.735)
✓ Goblet -> small intestine goblet cell (Similarity: 0.734)
✓ Enterocyte.Progenitor -> enterocyte differentiation (Similarity: 0.688)
✓ TA.Early -> transit amplifying cell (Similarity: 0.688)
✓ Endocrine -> endocrine hormone secretion (Similarity: 0.643)

我们可以发现TATA.Early在扩写了细胞名称后,被成功映射到了对应的细胞。

python 复制代码
adata.obs[['cell_label','cell_ontology','cell_ontology_similarity',
          'cell_ontology_ontology_id','cell_ontology_ontology_id','cell_ontology_cl_id']].head()

| | cell_label | cell_ontology | cell_ontology_similarity | cell_ontology_ontology_id | cell_ontology_ontology_id | cell_ontology_cl_id |
| index | | | | | | |
| B1_AAACATACCACAAC_Control_Enterocyte.Progenitor | Enterocyte.Progenitor | enterocyte differentiation | 0.688446 | http://purl.obolibrary.org/obo/GO_1903703 | http://purl.obolibrary.org/obo/GO_1903703 | None |
| B1_AAACGCACGAGGAC_Control_Stem | Stem | intestinal crypt stem cell | 0.735365 | http://purl.obolibrary.org/obo/CL_0002250 | http://purl.obolibrary.org/obo/CL_0002250 | CL:0002250 |
| B1_AAACGCACTAGCCA_Control_Stem | Stem | intestinal crypt stem cell | 0.735365 | http://purl.obolibrary.org/obo/CL_0002250 | http://purl.obolibrary.org/obo/CL_0002250 | CL:0002250 |
| B1_AAACGCACTGTCCC_Control_Stem | Stem | intestinal crypt stem cell | 0.735365 | http://purl.obolibrary.org/obo/CL_0002250 | http://purl.obolibrary.org/obo/CL_0002250 | CL:0002250 |

B1_AAACTTGACCACCT_Control_Enterocyte.Progenitor Enterocyte.Progenitor enterocyte differentiation 0.688446 http://purl.obolibrary.org/obo/GO_1903703 http://purl.obolibrary.org/obo/GO_1903703 None

🔍 映射结果检验

为确保映射结果的准确性,我们提供了多种验证方法:

  1. 手动查询匹配结果
  2. 检查相似度分数
  3. 验证本体论ID

实用技巧:

  • 相似度分数 > 0.7:高度可信
  • 相似度分数 0.5-0.7:需要人工核验
  • 相似度分数 < 0.5:建议使用LLM辅助或手动映射
python 复制代码
res=mapper.find_similar_cells("T helper cell", top_k=10)
res=mapper.find_similar_cells("Macrophage", top_k=8)
复制代码
🎯 Ontology cell types most similar to 'T helper cell':
 1. helper T cell                            (Similarity: 0.780)
 2. T-helper 1 cell activation               (Similarity: 0.738)
 3. T-helper 2 cell activation               (Similarity: 0.709)
 4. T-helper 9 cell                          (Similarity: 0.707)
 5. T-helper 2 cell                          (Similarity: 0.690)
 6. T-helper 1 cell                          (Similarity: 0.687)
 7. T cell domain                            (Similarity: 0.678)
 8. regulation of T-helper 1 cell activation (Similarity: 0.675)
 9. CD4-positive helper T cell               (Similarity: 0.664)
10. T-helper 1 cell cytokine production      (Similarity: 0.660)

🎯 Ontology cell types most similar to 'Macrophage':
 1. cycling macrophage                       (Similarity: 0.786)
 2. tissue-resident macrophage               (Similarity: 0.735)
 3. macrophage differentiation               (Similarity: 0.729)
 4. macrophage                               (Similarity: 0.719)
 5. epithelioid macrophage                   (Similarity: 0.718)
 6. macrophage migration                     (Similarity: 0.692)
 7. kidney interstitial alternatively activated macrophage (Similarity: 0.686)
 8. central nervous system macrophage        (Similarity: 0.685)

获取本体论中的细胞信息

python 复制代码
mapper.get_cell_info("regulatory T cell")
复制代码
ℹ️  === regulatory T cell ===
🆔 Ontology ID: http://purl.obolibrary.org/obo/CL_0000815
📝 Description: regulatory T cell: A T cell which regulates overall immune responses as well as the responses of other T cell subsets through direct cell-cell contact and cytokine release. This cell type may express FoxP3 and CD25 and secretes IL-10 and TGF-beta.

获取本体论中的细胞信息

python 复制代码
mapper.get_cell_info("regulatory T cells")
复制代码
✗ Cell type not found: regulatory T cells
🔍 Found 0 cell types containing 'regulatory t cells':

获取本体论中的细胞类别的信息

python 复制代码
my_categories = ["immune cell", "epithelial"]
mapper.browse_ontology_by_category(categories=my_categories, max_per_category=5)
复制代码
📂 === Browse Ontology Cell Types by Category ===

🔍 Found 0 cell types containing 'immune cell':
--------------------------------------------------
🔍 Found 395 cell types containing 'epithelial':
  1. NS forest marker set of airway submucosal gland collecting duct epithelial cell (Human Lung).
  2. epithelial fate stem cell
  3. epithelial cell
  4. ciliated epithelial cell
  5. duct epithelial cell
... 390 more results

🏷️  【epithelial related】 (Showing top 5):
  1. NS forest marker set of airway submucosal gland collecting duct epithelial cell (Human Lung).
  2. epithelial fate stem cell
  3. epithelial cell
  4. ciliated epithelial cell
  5. duct epithelial cell
--------------------------------------------------

查看本体论中的细胞信息

python 复制代码
# 查看前50个细胞类型
res=mapper.list_ontology_cells(max_display=10)
复制代码
📊 Total 16841 cell types in ontology

📋 First 10 cell types:
  1. TAC1
  2. STAB1
  3. TLL1
  4. MSR1
  5. TNC
  6. ROS1
  7. TNIP3
  8. HOMER3
  9. FCGR2B
 10. BPIFB2
... 16831 more cell types
💡 Use return_all=True to get complete list

了解本体论的整体情况

python 复制代码
# 了解本体论的整体情况
stats = mapper.get_ontology_statistics()
复制代码
📊 === Ontology Statistics ===
📝 Total cell types: 16841
📏 Average name length: 31.7 characters
📏 Shortest name length: 3 characters
📏 Longest name length: 144 characters

🔤 Most common words:
  of: 5473 times
  cell: 3857 times
  regulation: 3168 times
  negative: 1009 times
  positive: 1003 times
  process: 980 times
  development: 875 times
  differentiation: 727 times
  muscle: 639 times
  in: 571 times

📈 总结与展望

通过使用细胞本体论进行细胞类型映射,我们可以:

  1. 🎯 实现细胞类型注释的标准化
  2. 📊 提高数据的可比性和可重复性
  3. 🔗 促进不同数据集之间的整合分析
  4. 🧬 发现细胞类型之间的生物学联系

最佳实践建议:

  1. 在发表论文时,同时提供原始注释和映射后的细胞本体论ID
  2. 定期更新细胞本体论数据库,保持与最新研究同步
  3. 建立标准化的细胞类型注释流程,提高研究效率

🎉 恭喜你完成了本教程!现在你已经掌握了专业的细胞类型映射方法。