破解非模式物种GO/KEGG注释难题

我们知道，R语言进行GO/KEGG富集严重依赖 OrgDb 数据库。模式生物的分析还好，有对应的 OrgDb 包。但非模式生物就没这么幸运了。还好我们有 AnnotationHub。

1. 什么是 OrgDb？

OrgDb（Organism Database）是 Bioconductor 提供的物种注释数据库，包含基因、转录本、蛋白质等生物信息，例如：

基因名（gene symbol）
基因 ID（Entrez ID, Ensembl ID）
GO 注释（Gene Ontology）
KEGG 通路信息
染色体位置

常见的 OrgDb 数据库：

org.Hs.eg.db

（人类）
org.Mm.eg.db

（小鼠）
org.Dm.eg.db

（果蝇）

2. 为什么要用 AnnotationHub 下载 OrgDb？

通常我们可以直接安装 OrgDb 包，例如：

go 复制代码

BiocManager::install("org.Hs.eg.db")

但 AnnotationHub 更灵活 ，因为：

✅自动获取最新版本 （无需手动更新）

✅支持更多物种 （尤其是不常用的物种）

✅统一接口管理多个数据库

3. 安装 AnnotationHub

如果你还没有安装，先运行：

go 复制代码

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("AnnotationHub")

加载包：

go 复制代码

library(AnnotationHub)

4. 下载 OrgDb 数据库

方法 1：直接查询物种名称

go 复制代码

# 创建 AnnotationHub 连接
ah <- AnnotationHub()

# 查询人类（Homo sapiens）的 OrgDb
human_orgdb <- query(ah, c("OrgDb", "Homo sapiens"))
human_orgdb

输出示例：

go 复制代码

AnnotationHub with 1 record
# snapshotDate(): 2024-01-01
# names(): AH12345
# package(): org.Hs.eg.db
# title: Homo sapiens (Human) OrgDb
# dataprovider: Bioconductor
# species: Homo sapiens

下载数据：

go 复制代码

# 使用 ID 下载（替换成你的实际 ID）
hs_db <- ah[["AH12345"]]

方法 2：按数据库类型筛选

go 复制代码

# 先列出所有 OrgDb
orgdbs <- query(ah, "OrgDb")

# 查看支持的物种
orgdbs$species

下载小鼠（Mus musculus）的 OrgDb：

go 复制代码

mm_db <- orgdbs[orgdbs$species == "Mus musculus"]
mm_db <- ah[[mm_db$ah_id[1]]]  # 下载第一个匹配项

5. 使用 OrgDb 数据

（1）获取基因信息

go 复制代码

# 获取所有基因的 Entrez ID 和 Gene Symbol
columns(mm_db)  # 查看可用的字段
keytypes(mm_db) # 查看可用的 ID 类型

# 提取前 10 个基因的 Entrez ID 和 Symbol
genes <- select(mm_db, 
                keys = head(keys(mm_db, keytype = "ENTREZID")),
                columns = c("ENTREZID", "SYMBOL"),
                keytype = "ENTREZID")
head(genes)

（2）查询 GO 注释

go 复制代码

# 获取某个基因的 GO 注释
go_annot <- select(mm_db,
                   keys = "103980",  # 示例 Entrez ID
                   columns = c("GO", "ONTOLOGY"),
                   keytype = "ENTREZID")
go_annot

6. 总结

✅AnnotationHub 提供 OrgDb 数据库的统一接口

✅**query(ah, c("OrgDb", "物种名"))**快速查找

✅**ah[["ID"]]**下载数据

✅**select()**提取基因、GO、通路等信息

适合场景：

需要最新版注释数据时
研究非模式生物（如斑马鱼、水稻）
批量管理多个物种数据库

试试用 AnnotationHub 获取你的研究物种数据吧！ 🧬

最后提醒一下，AnnotationHub 由于需要从网络下载数据，经常发生中断的情况，比如报错：

go 复制代码

Loading required package: BiocFileCache
Loading required package: dbplyr
downloading 1 resources
retrieving 1 resource
Error loading resource.
 attempting to re-download
downloading 1 resources
retrieving 1 resource
Error: failed to load resource
  name: AH117408
  title: org.Zea_mays.eg.sqlite
  reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
  web resource path: 'https://annotationhub.bioconductor.org/fetch/124154'
  local file path: '/data/jobs/000/323/323996/home/.cache/R/AnnotationHub/196a96482_124154'
  reason: Transferred a partial file [mghp.osn.xsede.org]: end of response with 182687483 bytes missing 
2: bfcadd() failed; resource removed
  rid: BFC3
  fpath: 'https://annotationhub.bioconductor.org/fetch/124154'
  reason: download failed 
3: download failed
  hub path: 'https://annotationhub.bioconductor.org/fetch/124154'
  cache resource: 'AH117408 : 124154'
  reason: bfcadd() failed; see warnings() 
4: download failed
  web resource path: 'https://annotationhub.bioconductor.org/fetch/124154'
  local file path: '/data/jobs/000/323/323996/home/.cache/R/AnnotationHub/192777bc9d_124154'
  reason: Transferred a partial file [mghp.osn.xsede.org]: end of response with 162371316 bytes missing 
5: bfcadd() failed; resource removed
  rid: BFC4
  fpath: 'https://annotationhub.bioconductor.org/fetch/124154'
  reason: download failed 
6: download failed
  hub path: 'https://annotationhub.bioconductor.org/fetch/124154'
  cache resource: 'AH117408 : 124154'
  reason: bfcadd() failed; see warnings() 
Execution halted

遇到这种情况，最好提前缓存好需要的 OrgDb 资源，或者换一个能正常访问 https://annotationhub.bioconductor.org 的网络重试。

当然，最简单的，可以使用 Galaxy 生信云平台的在线工具。比如我们要进行玉米（Zea mays）的 GO 富集分析。

进入平台：https://usegalaxy.cn
上传玉米基因的 Entrez ID 列表（一行一个）。转换方法可参考本公众号内文章：如何进行基因ID转换？这个被NIH收编的David生信工具可以解决你的焦虑
搜索工具：clusterProfiler enrichGO
参数设置如下图所示

最后点击运行就可以了

运行结束后如果对图片的大小等不满意，可以修改参数多次运行，直至得到满意结果。

社区简介

中国银河生信云平台（UseGalaxy.cn）以"让生信分析更简单"为使命。平台致力于为科研工作者、医疗机构和生物产业技术人员提供全栈式生物信息学分析解决方案。

如何参与社区建设

• 多在社群活跃，积极参与讨论。

• 为平台制作图文或视频教程。

• 为平台制作工具或流程。

• 赞助平台。

联系方式

|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| | |

破解非模式物种GO/KEGG注释难题

1. 什么是 OrgDb？

2. 为什么要用 AnnotationHub 下载 OrgDb？

3. 安装 AnnotationHub

4. 下载 OrgDb 数据库

方法 1：直接查询物种名称

方法 2：按数据库类型筛选

5. 使用 OrgDb 数据

（1）获取基因信息

（2）查询 GO 注释

6. 总结

推荐阅读

社区简介

如何参与社区建设

联系方式