检索到目标数据集后,开始数据挖掘,本文以阿尔兹海默症数据集GSE1297为例
目录
1.AnnotGPL参数改为TRUE,联网下载芯片平台的soft文件。(国内网速奇慢经常中断)
[转换芯片探针ID为gene name](#转换芯片探针ID为gene name)
[# 找到GPL6244相应的注释包hgu133a.db](# 找到GPL6244相应的注释包hgu133a.db)
上节我们下载了GEO数据集,并提取了基因表达矩阵,但是矩阵行名称是芯片探针需要转换为基因名。

下载平台文件
1.AnnotGPL参数改为TRUE,联网下载芯片平台的soft文件。(国内网速奇慢经常中断)
注意:下载好soft文件,才可以直接可以提取注释,没下载好,注释内容全为空,后续代码可以运行但是,不能得到正确数据的矩阵。
提取注释信息
annotation <- featureData(gse_info\[1])
library(GEOquery)
# 指定GEO数据集的ID
gse_id <- "GSE1297"
# 使用getGEO函数获取数据集的基础信息
gse_info <- getGEO(gse_id, destdir = ".", AnnotGPL = TRUE )
# 提取注释信息
annotation <- featureData(gse_info[[1]]) #下载好soft,可以直接可以提取注释,没下载好注释内容全为空
#查看平台文件列名
colnames(annotation)
#仅提取两列,第一列芯片探针名,第十一列基因名
platform_file_set=annotation[,c(1,11)]

#还可以尝试单独下载GPL96平台文件
gse_gp<-getGEO('GPL96',destdir =".") # 网速不佳 下载失败提示 Failed to download ./GPL96.soft.gz!
2.手工去GEO官网下载

dir() #打印项目文件列表
# 读取芯片平台文件txt
platform_file <- read.delim("GPL96-57554.txt", header = TRUE, sep = "\t", comment.char = "#")
#查看平台文件列名
colnames(platform_file)
#仅提取两列,第一列芯片探针名,第十一列基因名
platform_file_set=platform_file[,c(1,11)]

转换芯片探针ID为gene name
先将上节中提取到的表达矩阵转换格式。
表达矩阵是Matrix对象,而我们接下来要用到的merge函数不能对Matrix对象使用,因此要先将表达矩阵转换为data.frame对象。否则会报错。Error in fix.by(by.x, x) : 'by'必需指定唯一有效的列。
#将Matrix格式表达矩阵转换为data.frame格式
exprSet <- data.frame(expression_data)
#给表达矩阵新增加一列ID
exprSet$ID <- rownames(exprSet) # 得到表达矩阵,行名为ID,需要转换,新增一列
#矩阵表达文件和平台文件有相同列'ID',使用merge函数合并
express <- merge(x = exprSet, y = platform_file_set, by.x = "ID")
#删除探针ID列
express$ID =NULL



最终将探针ID列删除,剩余32列,即得到有基因名称的表达矩阵。
大家观察最后一列,一个芯片探针匹配到多个基因,下节我们来看看处理方案。
拓展:通过bioconductor注释包
|-------|---------|
| GPL96 | hgu133a |
# 找到GPL6244相应的注释包hgu133a.db
gpl bioc_package title
GPL32 mgu74a MG_U74A Affymetrix Murine Genome U74A Array
GPL33 mgu74b MG_U74B Affymetrix Murine Genome U74B Array
GPL34 mgu74c MG_U74C Affymetrix Murine Genome U74C Array
GPL71 ag AG Affymetrix Arabidopsis Genome Array
GPL72 drosgenome1 DrosGenome1 Affymetrix Drosophila Genome Array
GPL74 hcg110 HC_G110 Affymetrix Human Cancer Array
GPL75 mu11ksuba Mu11KsubA Affymetrix Murine 11K SubA Array
GPL76 mu11ksubb Mu11KsubB Affymetrix Murine 11K SubB Array
GPL77 mu19ksuba Mu19KsubA Affymetrix Murine 19K SubA Array
GPL78 mu19ksubb Mu19KsubB Affymetrix Murine 19K SubB Array
GPL79 mu19ksubc Mu19KsubC Affymetrix Murine 19K SubC Array
GPL80 hu6800 Hu6800 Affymetrix Human Full Length HuGeneFL Array
GPL81 mgu74av2 MG_U74Av2 Affymetrix Murine Genome U74A Version 2 Array
GPL82 mgu74bv2 MG_U74Bv2 Affymetrix Murine Genome U74B Version 2 Array
GPL83 mgu74cv2 MG_U74Cv2 Affymetrix Murine Genome U74 Version 2 Array
GPL85 rgu34a RG_U34A Affymetrix Rat Genome U34 Array
GPL86 rgu34b RG_U34B Affymetrix Rat Genome U34 Array
GPL87 rgu34c RG_U34C Affymetrix Rat Genome U34 Array
GPL88 rnu34 RN_U34 Affymetrix Rat Neurobiology U34 Array
GPL89 rtu34 RT_U34 Affymetrix Rat Toxicology U34 Array
GPL90 ygs98 YG_S98 Affymetrix Yeast Genome S98 Array
GPL91 hgu95av2 HG_U95A Affymetrix Human Genome U95A Array
GPL92 hgu95b HG_U95B Affymetrix Human Genome U95B Array
GPL93 hgu95c HG_U95C Affymetrix Human Genome U95C Array
GPL94 hgu95d HG_U95D Affymetrix Human Genome U95D Array
GPL95 hgu95e HG_U95E Affymetrix Human Genome U95E Array
GPL96 hgu133a HG-U133A Affymetrix Human Genome U133A Array
GPL97 hgu133b HG-U133B Affymetrix Human Genome U133B Array
GPL98 hu35ksuba Hu35KsubA Affymetrix Human 35K SubA Array
GPL99 hu35ksubb Hu35KsubB Affymetrix Human 35K SubB Array
GPL100 hu35ksubc Hu35KsubC Affymetrix Human 35K SubC Array
GPL101 hu35ksubd Hu35KsubD Affymetrix Human 35K SubD Array
GPL198 ath1121501 ATH1-121501 Affymetrix Arabidopsis ATH1 Genome Array
GPL199 ecoli2 Ecoli_ASv2 Affymetrix E. coli Antisense Genome Array
GPL200 celegans Celegans Affymetrix C. elegans Genome Array
GPL201 hgfocus HG-Focus Affymetrix Human HG-Focus Target Array
GPL339 moe430a MOE430A Affymetrix Mouse Expression 430A Array
GPL340 mouse4302 MOE430B Affymetrix Mouse Expression 430B Array
GPL341 rae230a RAE230A Affymetrix Rat Expression 230A Array
GPL342 rae230b RAE230B Affymetrix Rat Expression 230B Array
GPL570 hgu133plus2 HG-U133_Plus_2 Affymetrix Human Genome U133 Plus 2.0 Array
GPL571 hgu133a2 HG-U133A_2 Affymetrix Human Genome U133A 2.0 Array
GPL886 hgug4111a Agilent-011871 Human 1B Microarray G4111A (Feature Number version)
GPL887 hgug4110b Agilent-012097 Human 1A Microarray (V2) G4110B (Feature Number version)
GPL1261 mouse430a2 Mouse430_2 Affymetrix Mouse Genome 430 2.0 Array
GPL1318 xenopuslaevis Xenopus_laevis Affymetrix Xenopus laevis Genome Array
GPL1319 zebrafish Zebrafish Affymetrix Zebrafish Genome Array
GPL1322 drosophila2 Drosophila_2 Affymetrix Drosophila Genome 2.0 Array
GPL1352 u133x3p U133_X3P Affymetrix Human X3P Array
GPL1355 rat2302 Rat230_2 Affymetrix Rat Genome 230 2.0 Array
GPL1708 hgug4112a Agilent-012391 Whole Human Genome Oligo Microarray G4112A (Feature Number version)
GPL2112 bovine Bovine Affymetrix Bovine Genome Array
GPL2529 yeast2 Yeast_2 Affymetrix Yeast Genome 2.0 Array
GPL2891 h20kcod GE Healthcare/Amersham Biosciences CodeLink™ UniSet Human 20K I Bioarray
GPL2898 adme16cod GE Healthcare/Amersham Biosciences CodeLink™ ADME Rat 16-Assay Bioarray
GPL3154 ecoli2 E_coli_2 Affymetrix E. coli Genome 2.0 Array
GPL3213 chicken Chicken Affymetrix Chicken Genome Array
GPL3533 porcine Porcine Affymetrix Porcine Genome Array
GPL3738 canine2 Canine_2 Affymetrix Canine Genome 2.0 Array
GPL3921 hthgu133a HT_HG-U133A Affymetrix HT Human Genome U133A Array
GPL3979 canine Canine Affymetrix Canine Genome 1.0 Array
GPL4032 Maize Affymetrix Maize Genome Array
GPL4191 h10kcod CodeLink UniSet Human I Bioarray
GPL5188 huex10sttranscriptcluster HuEx-1_0-st Affymetrix Human Exon 1.0 ST Array probe set (exon) version
GPL5689 hgug4100a Agilent Human 1 cDNA Microarray (G4100A) layout C
GPL6097 illuminaHumanv1 Illumina human-6 v1.0 expression beadchip
GPL6102 illuminaHumanv2 Illumina human-6 v2.0 expression beadchip
GPL6244 hugene10sttranscriptcluster HuGene-1_0-st Affymetrix Human Gene 1.0 ST Array transcript (gene) version
GPL6246 mogene10sttranscriptcluster MoGene-1_0-st Affymetrix Mouse Gene 1.0 ST Array transcript (gene) version
GPL6885 illuminaMousev2 Illumina MouseRef-8 v2.0 expression beadchip
GPL6947 illuminaHumanv3 Illumina HumanHT-12 V3.0 expression beadchip
GPL8300 hgu95av2 HG_U95Av2 Affymetrix Human Genome U95 Version 2 Array
GPL8321 mouse430a2 Mouse430A_2 Affymetrix Mouse Genome 430A 2.0 Array
GPL8490 IlluminaHumanMethylation27k Illumina HumanMethylation27 BeadChip (HumanMethylation27_270596_v.1.2)
GPL10558 illuminaHumanv4 Illumina HumanHT-12 V4.0 expression beadchip
GPL11532 hugene11sttranscriptcluster HuGene-1_1-st Affymetrix Human Gene 1.1 ST Array transcript (gene) version
GPL13497 HsAgilentDesign026652 Agilent-026652 Whole Human Genome Microarray 4x44K v2 (Probe Name version)
GPL13534 IlluminaHumanMethylation450k Illumina HumanMethylation450 BeadChip (HumanMethylation450_15017482)
GPL13667 hgu219 HG-U219 Affymetrix Human Genome U219 Array
GPL14877 hgu133plus2 Affymetrix Human Genome U133 Plus 2.0 Array Brainarray Version 13, HGU133Plus2_Hs_ENTREZG
GPL15380 GGHumanMethCancerPanelv1 Illumina Sentrix Array Matrix (SAM) - GoldenGate Methylation Cancer Panel I
GPL15396 hthgu133b HT_HG-U133B Affymetrix HT Human Genome U133B Array custom CDF: ENTREZ brainarray v. 14
GPL17556 hugene10sttranscriptcluster HuGene-1_0-st Affymetrix Human Gene 1.0 ST Array HuGene10stv1_Hs_ENTREZG_17.0.0
GPL17897 hthgu133a HT_HG-U133A Affymetrix Human Genome U133A Array (custom CDF: HTHGU133A_Hs_ENTREZG.cdf version 17.0.0)
GPL18190 hugene11sttranscriptcluster HuGene-1_1-st Affymetrix Human Gene 1.1 ST Array CDF: Brainarray HuGene11stv1_Hs_ENTREZG_15.1.0
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL96