R语言读取单细胞转录组基因表达矩阵loom文件

以GSE160756数据集为例,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE160756

下载上传服务器,解压缩为loom文件后,先尝试用python来打开。

import loompy

import numpy as np

with loompy.connect("GSM4878538_umi_hNP_1.loom") as ds:

print(ds.shape) # 输出矩阵维度

print(ds.row_attrs["Gene"][:10]) # 输出前10个基因名

print(ds.col_attrs["CellID"][:5]) # 输出前5个细胞ID

出现报错:

with loompy.connect("GSM4878538_umi_hNP_1.loom") as ds:

File "/usr/local/lib/python3.9/site-packages/loompy/loompy.py", line 1634, in connect

return LoomConnection(filename, mode, validate=validate)

File "/usr/local/lib/python3.9/site-packages/loompy/loompy.py", line 86, in init

raise ValueError("\n".join(lv.errors) + f"\n{filename} does not appear to be a valid Loom file according to Loom spec version '{lv.version}'")

ValueError: Row attribute 'gene_names' dtype object is not allowed

Column attribute 'cell_names' dtype object is not allowed

For help, see http://linnarssonlab.org/loompy/format/

GSM4878538_umi_hNP_1.loom does not appear to be a valid Loom file according to Loom spec version '0.0.0'

不知道为啥在python打不开。

R:

安装相关R包:

BiocManager::install("LoomExperiment")

remotes::install_github("aertslab/SCopeLoomR")

最终R运行:

library(hdf5r)

library(loomR)

library(LoomExperiment)

library(SCopeLoomR)

conn <- connect("GSM4878538_umi_hNP_1.loom")

再查看conn:

> conn

Class: loom

Filename: /xxx/GSM4878538_umi_hNP_1.loom

Access type: H5F_ACC_RDONLY

Attributes: version, chunks

Listing:

name obj_type dataset.dims dataset.type_class

col_attrs H5I_GROUP <NA> <NA>

col_graphs H5I_GROUP <NA> <NA>

layers H5I_GROUP <NA> <NA>

matrix H5I_DATASET 12385 x 26418 H5T_FLOAT

row_attrs H5I_GROUP <NA> <NA>

row_graphs H5I_GROUP <NA> <NA>

读取出矩阵:

> GCMat <- as.data.frame(conn[["matrix"]][,])

> GCMat[1:4,1:4]

V1 V2 V3 V4

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

4 0 0 0 0

> colnames(GCMat) <- conn[["row_attrs/gene_names"]][]

> rownames(GCMat) <- conn[["col_attrs/cell_names"]][]

> GCMat[1:4,1:5]

AL627309.1 AL627309.6 AL627309.5 AL627309.4

hNP_1_AAACCCACAAAGACTA-1-1 0 0 0 0

hNP_1_AAACCCACACTTCTCG-1-1 0 0 0 0

hNP_1_AAACCCACAGCGTATT-1-1 0 0 0 0

hNP_1_AAACCCACATCTGGGC-1-1 0 0 0 0

FO538757.1

hNP_1_AAACCCACAAAGACTA-1-1 0

hNP_1_AAACCCACACTTCTCG-1-1 0

hNP_1_AAACCCACAGCGTATT-1-1 0

hNP_1_AAACCCACATCTGGGC-1-1 0

注意要转置:

> GCMat <- t(GCMat)

转换成稀疏矩阵:

> GCMat <- as(GCMat,"dgCMatrix")

> GCMat[1:4,1:4]

4 x 4 sparse Matrix of class "dgCMatrix"

hNP_1_AAACCCACAAAGACTA-1-1 hNP_1_AAACCCACACTTCTCG-1-1

AL627309.1 . .

AL627309.6 . .

AL627309.5 . .

AL627309.4 . .

替换细胞名前面的编号,以方便管理:

AllCells <- colnames(GCMat)

NewCells <- c()

for(i in AllCells)

{

TCell <- gsub("hNP_1zhong","hNP-3",i)

NewCells <- c(NewCells,TCell)

}

colnames(GCMat) <- NewCells

相关推荐
大尚来也2 小时前
高并发架构下的缓存“三座大山”:穿透、雪崩与击穿的深度突围
开发语言
暮冬-  Gentle°2 小时前
移动设备上的C++优化
开发语言·c++·算法
2401_874732532 小时前
C++中的装饰器模式高级应用
开发语言·c++·算法
lars_lhuan2 小时前
Go atomic
开发语言·后端·golang
lly2024062 小时前
《Foundation 分页》
开发语言
m0_662577972 小时前
模板编译期哈希计算
开发语言·c++·算法
m0_662577972 小时前
C++代码静态检测
开发语言·c++·算法
阿贵---2 小时前
编译器命令选项优化
开发语言·c++·算法
add45a2 小时前
分布式计算C++库
开发语言·c++·算法