今天我们将下载10x官方人肺癌FFPE样本Xenium5k下机数据,使用python的spatialdata库,演示如何进行Xenium单个样本/多样本数据读取,以及简单绘图功能展示。
1. 示例数据下载:
数据下载地址: https://www.10xgenomics.com/datasets/xenium-human-lung-cancer-post-xenium-technote

文件解压缩后,文件层级展示如下图,一般一个FOV对应的是一个样本,一个样本对应一个文件夹结果,一张芯片上最多可以选8个FOV,若果一张芯片上拼的样本数超过8个,就会有多个样本被并到一个FOV中,后续下机数据分析的时候想要拆分开的话,需要使用Xenium browser手动圈选,拿到个样本的barcodes,然后就可以拆分样本(一般TMA样本都需要手动圈选操作)。这里我们下载的数据就只有一个FOV,也就是只有一个样本,所以下图展示的是这一个样本的数据。

各关键文件说明如下,10x官方给出了很详细的说明:https://www.10xgenomics.com/cn/support/software/xenium-onboard-analysis/latest/analysis/xoa-output-understanding-outputs
File type | File and description |
---|---|
Experiment file | experiment.xenium : Experiment manifest file. |
Interactive summary | analysis_summary.html : Summary metrics, graphs, and images to QC your run data in HTML format. |
Image files | morphology.ome.tif : The 3D nuclei-stained (DAPI) morphology image in OME-TIFF format. |
Image files | morphology_focus/ : A directory containing the multi-focus projection of morphology image(s) in a multi-file OME-TIFF format (2D). The directory will contain the nuclei DAPI stain image, as well as three additional stain images for Xenium outputs generated with the multimodal cell segmentation assay workflow. |
Cell summary | cells.csv.gz : Cell summary file. |
Cell summary | cells.parquet : Cell summary file in Parquet format. |
Cell segmentation masks and polygons | cells.zarr.zip : Cell summary file in zipped Zarr format, only file that contains the nucleus and cell segmentation masks and boundaries used for transcript assignment. |
Cell boundary polygons | cell_boundaries.csv.gz : Cell boundary file. |
cell_boundaries.parquet : Cell boundary file in Parquet format. |
|
Nucleus boundary polygons | nucleus_boundaries.csv.gz : Nucleus boundary file. |
nucleus_boundaries.parquet : Nucleus boundary file in Parquet format. |
|
Transcript data | transcripts.parquet : Transcripts data in Parquet format. |
transcripts.zarr.zip : Transcript data in zipped Zarr format. |
|
Cell-feature matrix | cell_feature_matrix/ : Directory of the cell-feature matrix files in Market Exchange format. |
cell_feature_matrix.h5 : Cell-feature matrix file in HDF5 format. |
|
cell_feature_matrix.zarr.zip : Cell-feature matrix file in zipped Zarr format. |
|
Metric summary | metrics_summary.csv : Summary of key metrics. |
Secondary analysis | analysis/ : Directory of secondary analysis results. |
analysis.zarr.zip : Secondary analysis outputs in zipped Zarr format. |
|
Gene panel | gene_panel.json : Copy of input gene panel file. |
Auxiliary data (aux_outputs/ ) |
* morphology_fov_locations.json : Field of view (FOV) name and position information (in microns). * overview_scan_fov_locations.json : FOV name and position information (in pixels). * per_cycle_channel_images/ : Directory of downsampled RNA image files in TIFF format from each cycle and channel. * overview_scan.png : Full resolution image of entire slide sample. * background_qc_images/ : Directory of autofluorescence images (downsampled, TIFF format) that are subtracted from the raw stain images to produce the morphology_focus/ images if Cell Segmentation Staining protocol used. |
analysis_summary.html对下机数据有个整体了解

下面展示的是细胞分割依据,有15.8%的细胞是根据细胞膜染色帮助识别分割细胞(这部分细胞分割最准确),有82.0%的细胞是通过针对细胞内18S核糖体RNA染色方法来标记并分割细胞,有2.2%的细胞通过DAPI识别出细胞核后,向外扩5um认为是细胞边界(这些细胞分割最不准确)


下面展示的是morphology_focus文件夹下的4个ome.tif文件,对应的是4个通道,0000是DAPI, 0001是green, 0002是yellow, 0003是red。

2. 安装依赖库

3. 数据读取
由于Xenium下机数据较大,多个样本按顺序读取耗时较长,这里我们使用多线程并行读取,缩短时间。
data_dir参数是xenium下机数据文件位置;
sample_info参数是样本信息.txt文件,一共三列,第一列是下机数据问价夹名称,第二列是样本名称,第三列是样本分组名称,使用'\t'分隔,有多个样本的,文件中就有多行。
import osimport threadingimport spatialdata as sdfrom spatialdata_io import xenium
# 多线程读取Xenium下机数据读取def xenium_data_load_multithreaded(data_dir, sample_info): def sd_read_xenium(sample_data, sample_name, sdata_dict): sdata = xenium(path=sample_data, cells_boundaries=True, n_jobs=6) sdata_dict[sample_name] = sdata threads = [] sdata_dict = {} sample_2_group = {} with open(sample_info, 'r') as f: for line in f: raw_name, sample_name, group_name = line.strip().split('\t')[:3] # 这里根据自己实际情况修改 sample_2_group[sample_name] = group_name thread = threading.Thread(target=sd_read_xenium, args=(os.path.join(data_dir, raw_name),sample_name, sdata_dict,)) threads.append(thread) thread.start() for thread in threads: thread.join() sdata = sd.concatenate( sdata_dict, concatenate_tables=True, # 这里是将多样本的单细胞数据合并在一起到table中 obs_names_make_unique=True ) sdata.tables['table'].obs["sample"] = sdata.tables['table'].obs["region"].str.replace('cell_circles-', '') sdata.tables['table'].obs["group"] = sdata.tables['table'].obs["sample"].apply(lambda x: sample_2_group[x]) sdata.tables['table'].obs["cell_boundaries"] = sdata.tables['table'].obs["region"].str.replace('cell_circles', 'cell_boundaries') sdata.set_table_annotates_spatialelement(table_name='table', region=[i for i in sdata.shapes.keys() if i.startswith('cell_boundaries-')], region_key='cell_boundaries')
return sdata
简单绘图展示
fig, ax = plt.subplots(figsize=(10, 10))sdata.pl.render_images("morphology_focus-S1").pl.show(ax=ax, title="Morphology plot", coordinate_systems="global")
ax.grid(False)

基因表达
from spatialdata import bounding_box_query
fig, ax = plt.subplots(figsize=(10, 10))crop0 = lambda x: bounding_box_query( x, min_coordinate=[10000, 20000], max_coordinate=[15000, 25000], axes=("x", "y"), target_coordinate_system="global",)crop0(sdata).pl.render_shapes( "cell_boundaries-S1", color='EPCAM', outline_width=0.3, outline_alpha=0.9, outline_color='grey').pl.show(ax=ax, title="EPCAM gene expression", coordinate_systems="global")ax.grid(False)ax.axis('off')
