以GSE160756数据集为例,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE160756
下载上传服务器,解压缩为loom文件后,先尝试用python来打开。
import loompy
import numpy as np
with loompy.connect("GSM4878538_umi_hNP_1.loom") as ds:
print(ds.shape) # 输出矩阵维度
print(ds.row_attrs["Gene"][:10]) # 输出前10个基因名
print(ds.col_attrs["CellID"][:5]) # 输出前5个细胞ID
出现报错:
with loompy.connect("GSM4878538_umi_hNP_1.loom") as ds:
File "/usr/local/lib/python3.9/site-packages/loompy/loompy.py", line 1634, in connect
return LoomConnection(filename, mode, validate=validate)
File "/usr/local/lib/python3.9/site-packages/loompy/loompy.py", line 86, in init
raise ValueError("\n".join(lv.errors) + f"\n{filename} does not appear to be a valid Loom file according to Loom spec version '{lv.version}'")
ValueError: Row attribute 'gene_names' dtype object is not allowed
Column attribute 'cell_names' dtype object is not allowed
For help, see http://linnarssonlab.org/loompy/format/
GSM4878538_umi_hNP_1.loom does not appear to be a valid Loom file according to Loom spec version '0.0.0'
不知道为啥在python打不开。
R:
安装相关R包:
BiocManager::install("LoomExperiment")
remotes::install_github("aertslab/SCopeLoomR")
最终R运行:
library(hdf5r)
library(loomR)
library(LoomExperiment)
library(SCopeLoomR)
conn <- connect("GSM4878538_umi_hNP_1.loom")
再查看conn:
> conn
Class: loom
Filename: /xxx/GSM4878538_umi_hNP_1.loom
Access type: H5F_ACC_RDONLY
Attributes: version, chunks
Listing:
name obj_type dataset.dims dataset.type_class
col_attrs H5I_GROUP <NA> <NA>
col_graphs H5I_GROUP <NA> <NA>
layers H5I_GROUP <NA> <NA>
matrix H5I_DATASET 12385 x 26418 H5T_FLOAT
row_attrs H5I_GROUP <NA> <NA>
row_graphs H5I_GROUP <NA> <NA>
读取出矩阵:
> GCMat <- as.data.frame(conn[["matrix"]][,])
> GCMat[1:4,1:4]
V1 V2 V3 V4
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
> colnames(GCMat) <- conn[["row_attrs/gene_names"]][]
> rownames(GCMat) <- conn[["col_attrs/cell_names"]][]
> GCMat[1:4,1:5]
AL627309.1 AL627309.6 AL627309.5 AL627309.4
hNP_1_AAACCCACAAAGACTA-1-1 0 0 0 0
hNP_1_AAACCCACACTTCTCG-1-1 0 0 0 0
hNP_1_AAACCCACAGCGTATT-1-1 0 0 0 0
hNP_1_AAACCCACATCTGGGC-1-1 0 0 0 0
FO538757.1
hNP_1_AAACCCACAAAGACTA-1-1 0
hNP_1_AAACCCACACTTCTCG-1-1 0
hNP_1_AAACCCACAGCGTATT-1-1 0
hNP_1_AAACCCACATCTGGGC-1-1 0
注意要转置:
> GCMat <- t(GCMat)
转换成稀疏矩阵:
> GCMat <- as(GCMat,"dgCMatrix")
> GCMat[1:4,1:4]
4 x 4 sparse Matrix of class "dgCMatrix"
hNP_1_AAACCCACAAAGACTA-1-1 hNP_1_AAACCCACACTTCTCG-1-1
AL627309.1 . .
AL627309.6 . .
AL627309.5 . .
AL627309.4 . .
替换细胞名前面的编号,以方便管理:
AllCells <- colnames(GCMat)
NewCells <- c()
for(i in AllCells)
{
TCell <- gsub("hNP_1zhong","hNP-3",i)
NewCells <- c(NewCells,TCell)
}
colnames(GCMat) <- NewCells