文献速递：深度学习疾病预后--DeepProg：使用多组学数据的预后预测的深度学习和机器学习模型集成

Title

题目

DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data

DeepProg：使用多组学数据的预后预测的深度学习和机器学习模型集成

文献速递介绍

大多数基于生存的分子签名都是基于单一类型的组学数据[1]。由于每种组学平台都有其特定的局限性和噪音，基于多组学的综合方法理论上可以产生更加连贯的签名。然而，与此方法相比，预测临床表型的研究探索较少，这是由于计算和实际挑战的结合。这些挑战包括平台特定的测量偏差、需要适当标准化的不同数据分布，以及由于高成本导致的具有多组学测量的样本量非常有限。

在多组学数据整合方法中，大多数方法没有将患者生存作为目标来建模；相反，与分子亚型相关的生存差异是以事后的方式评估的。此外，许多方法使用非监督的方法，不适用于预测新患者状态，如iCluster、相似性网络融合（SNF）、MAUI和多组学因子分析（MOFA+）等例子。iCluster是最早的方法之一，它使用概率建模将数据投影到较低的嵌入空间，以基于多组学特征将癌症样本聚类到不同的分子亚型[7]。相似性网络融合（SNF）算法是另一种流行的聚类方法，用于整合不同的组学特征，通过首先为每种组学构建一个独特的相似性网络，然后使用迭代程序融合网络，它已应用于多个TCGA癌症数据集。MAUI是一个用于多组学整合的非线性降维框架，使用变分自编码器产生潜在特征，这些特征可以用于聚类或分类。同样，MOFA+是一个使用因子分析通过标准矩阵分解来从多组学数据集中推断解释最多变异来源的潜在变量的统计框架。

识别疾病亚型在临床上非常重要。例如，通常认为特定器官的癌症有多个亚型。使用分子签名识别癌症亚型允许超越肿瘤阶段、等级或组织来源对肿瘤进行分类。共享相似的分子和途径变化的癌症亚型可以使用相同的药物治疗。其中一种亚型是基于预后签名对患者亚型进行生存分层。一旦推断出来，它们的签名可以作为后续治疗或预后研究的起点。此外，与患者生存相关的分子差异有助于理解肿瘤进展的机制。获得的知识不仅有助于改善疾病监测和管理，还提供了预防和治疗的信息。

Abstract

摘要

Multi-omics data are good resources for prognosis and survival prediction; however, these are difficult to integrate computationally. We introduce DeepProg, a novel ensemble framework of deep-learning and machine-learning approaches that robustly predicts patient survival subtypes using multi-omics data. It identifies two optimal survival subtypes in most cancers and yields significantly better risk-stratification than other multi-omics integration

methods. DeepProg is highly predictive, exemplified by two liver cancer (C-index 0.73--0.80) and five breast cancer datasets (C-index 0.68--0.73). Pan-cancer analysis associates common genomic signatures in poor survival subtypes with extracellular matrix modeling, immune deregulation, and mitosis processes.

多组学数据是预后和生存预测的良好资源；然而，这些数据在计算上难以整合。我们引入了DeepProg，一个新型的集成框架，它采用深度学习和机器学习方法，能够稳健地预测使用多组学数据的患者生存亚型。它在大多数癌症中识别出两种最优生存亚型，并且比其他多组学整合方法提供了显著更好的风险分层。DeepProg具有高度的预测性，通过两个肝癌（C指数0.73--0.80）和五个乳腺癌数据集（C指数0.68--0.73）的例子加以证明。全癌症分析将在不良生存亚型中常见的基因组特征与细胞外基质建模、免疫调控失调和有丝分裂过程关联起来。

Results

结果

DeepProg is a general hybrid and flexible computational framework to predict patient survival based on one or more omics data types, such as mRNA transcriptomics, DNA methylation, and microRNA expression (Fig. 1).

The first phase of DeepProg is composed of dimension reduction and feature transformation using custom rank normalizations and auto-encoders, a type of deep neural-network. It uses "modularized" design for auto encoders, where each data type is modeled by one auto encoder, to allow flexibility and extendibility to hetero geneous data types. In the default implementation, the auto-encoders have 3 layers, the input layer, the hidden layer (100 nodes), and the output layer. The transformed features are then subject to univariate Cox-PH fitting, in order to further select a subset of features linked to sur vival. Next, using unsupervised clustering approach, DeepProg identifies the optimal number of classes (labels) of survival subpopulations and uses these classes to construct support vector machine (SVM)-based machine-learning models, in order to predict a new pa tient's survival group. To ensure the robustness of the models, DeepProg adopts a boosting approach and builds an ensemble of models. The boosting approach yields more accurate p values and C-indices with lower variances and leads to faster convergence of the models (Additional File 1: Table S1). Each of these models is constructed with a random subset (e.g., 4/5) of the ori ginal dataset and evaluated using the C-index value from the remaining hold-out (e.g., 1/5) testing samples. For efficiency, the computation of DeepProg is fully distrib uted, since each model can be fit separately.

DeepProg是一个通用的混合且灵活的计算框架，用于预测基于一种或多种组学数据类型的患者生存情况，如mRNA转录组学、DNA甲基化和microRNA表达（图1）。DeepProg的第一阶段包括使用自定义等级标准化和自动编码器（一种深度神经网络）进行维度降低和特征转换。它采用"模块化"设计自动编码器，每种数据类型由一个自动编码器建模，以允许灵活性和可扩展性到异质数据类型。在默认实现中，自动编码器有3层，输入层、隐藏层（100个节点）和输出层。然后将转换的特征进行单变量Cox-PH拟合，以便进一步选择与生存相关的特征子集。接下来，使用无监督聚类方法，DeepProg识别生存亚群的最优类别（标签）数量，并使用这些类别构建支持向量机（SVM）基础的机器学习模型，以预测新患者的生存组。为确保模型的稳健性，DeepProg采用提升方法并构建模型集成。提升方法产生更准确的p值和C指数，具有较低的方差，并导致模型更快的收敛（附加文件1：表S1）。这些模型中的每一个都是用原始数据集的随机子集（例如，4/5）构建的，并使用剩余保留出来的（例如，1/5）测试样本的C指数值进行评估。为了效率，DeepProg的计算是完全分布式的，因为每个模型可以单独拟合。

Methods

方法

We obtain the 32 cancer multi-omic datasets from NCBI using TCGA portal ( https://tcga-data.nci.nih.gov/tcga/). We use the package TCGA-Assembler (versions 2.0.5) and write custom scripts to download RNA-Seq (UNC IlluminaHiSeq RNASeqV2), miRNA Sequencing (BCGSC IlluminaHiSeq, Level 3), and DNA methylation (JHU-USC HumanMethylation450) data from the TCGA website on November 4--14, 2017. We also obtain the survival information from the portal: https://portal.gdc. cancer.gov/. We use the same preprocessing steps as de tailed in our previous study. We first download RNA-Seq, miRNA-Seq, and methylation data using the functions DownloadRNASeqData, DownloadmiRNASeq Data, and DownloadMethylationData from TCGA Assembler, respectively. Then, we process the data with the functions ProcessRNASeqData, ProcessmiRNASeq Data, and ProcessMethylation450Data. In addition, we process the methylation data with the function Calcula teSingleValueMethylationData. Finally, for each omic data type, we create a gene-by-sample data matrix in the Tabular Separated Value (TSV) format using a custom script.

我们通过使用TCGA门户（https://tcga-data.nci.nih.gov/tcga/）从NCBI获取32个癌症多组学数据集。我们使用TCGA-Assembler软件包（版本2.0.5）并编写自定义脚本，于2017年11月4日至14日从TCGA网站下载RNA测序（UNC IlluminaHiSeq RNASeqV2）、miRNA测序（BCGSC IlluminaHiSeq，3级）和DNA甲基化（JHU-USC HumanMethylation450）数据。我们还从门户网站：https://portal.gdc.cancer.gov/ 获取生存信息。我们使用与我们之前的研究中详细描述的相同的预处理步骤。我们首先使用TCGA Assembler的DownloadRNASeqData、DownloadmiRNASeqData和DownloadMethylationData函数分别下载RNA-Seq、miRNA-Seq和甲基化数据。然后，我们使用ProcessRNASeqData、ProcessmiRNASeqData和ProcessMethylation450Data函数处理数据。此外，我们使用CalculateSingleValueMethylationData函数处理甲基化数据。最后，对于每种组学数据类型，我们使用自定义脚本创建一个基因-样本数据矩阵，格式为制表符分隔值（TSV）

Conclusions

结论

DeepProg is a novel ensemble framework of deep-learning and machine-learning approaches that robustly predicts patient survival subtypes using multi-omics data. We an ticipate that DeepProg models are informative to predict patient survival risks, in diseases such as cancers.

DeepProg是一个新型的集成框架，采用深度学习和机器学习方法，能够稳健地预测使用多组学数据的患者生存亚型。我们预期DeepProg模型在预测患者生存风险方面具有参考价值，例如在癌症等疾病中。

Fig

图

Fig. 1 The computational framework of DeepProg. DeepProg uses the boosting strategy to build several models from a random subset of the dataset. For each model, each omic data matrix is normalized and then transformed using an autoencoder. Each of the new hidden-layer features in autoencoder is then tested for association with survival using univariate Cox-PH models. The features significantly associated with survival are then subject to clustering (Gaussian clustering by default). Upon determining the optimal cluster, the top features in each omic input data type are selected through Kruskal-Wallis analysis (default threshold = 0.05). Finally, these top omics features are used to construct a support vector machine (SVM) classifier and to predict the survival risk group of a new sample. DeepProg combines the outputs of all the classifier models to produce more robust results

图1 DeepProg的计算框架。DeepProg使用提升策略从数据集的随机子集中构建若干模型。对于每个模型，每个组学数据矩阵被标准化，然后使用自动编码器转换。自动编码器中的每个新隐藏层特征接下来使用单变量Cox-PH模型测试与生存的关联。与生存显著相关的特征随后经过聚类处理（默认为高斯聚类）。确定最佳聚类后，通过Kruskal-Wallis分析（默认阈值=0.05）选出每个组学输入数据类型中的顶级特征。最后，这些顶级组学特征用于构建支持向量机（SVM）分类器，以预测新样本的生存风险组。DeepProg结合所有分类器模型的输出，以产生更加稳健的结果。

Fig. 2 DeepProg performance for the 32 TCGA cancer datasets.A Kaplan-Meier plots for each cancer type, where the survival risk group stratification is determined by DeepProg. B The density distributions of -log10 (log-rank p value) for the Cox-PH models based on the subtypes determined by DeepProg (light grey line), SNF (dark grey line), or the pair-wise -log10 (log-rank p value) differences between DeepProg and SNF (blue line).C Smoothed C-index distributions for the Cox-PH models based on the subtypes determined by DeepProg (light grey line), SNF (dark grey line), or the pair-wise C-index difference between DeepProg and SNF (blue line)

图2 DeepProg在32个TCGA癌症数据集上的表现。

A Kaplan-Meier曲线，每种癌症类型的生存风险组分层由DeepProg确定。B 根据DeepProg（浅灰线）、SNF（深灰线）确定的亚型或DeepProg与SNF之间的成对-log10（对数秩p值）差异（蓝线）的-log10（对数秩p值）的密度分布图。C 根据DeepProg（浅灰线）、SNF（深灰线）确定的亚型或DeepProg与SNF之间的成对C指数差异（蓝线）的平滑C指数分布图。

Fig. 3 Comparing the performance of DeepProg and its variations, where the default autoencoder is substituted by a simple PCA decomposition or MOFA+ method to generate an input matrix of the same dimensions, using TCGA HCC (A--C) and BRCA (D--F) datasets. Methods in comparison: A, D PCA; B, E MOFA+; C, F DeepProg default

图3 比较DeepProg及其变体的性能，其中默认的自动编码器被简单的PCA分解或MOFA+方法替换，以生成相同维度的输入矩阵，使用TCGA HCC（A-C）和BRCA（D-F）数据集。比较的方法：A, D PCA；B, E MOFA+；C, F DeepProg默认设置。

Fig. 4 Validation of DeepProg subtype predictions by independent breast cancer and liver cancer cohorts. RNA-Seq Validation datasets for HCC: A LIRI (n = 230) and B GSE (n = 221) and validation datasets for BRCA: C Patiwan (n = 159), D Metabric (n = 1981), E Anna (n = 249), and F Miller (n = 236)

图4 通过独立的乳腺癌和肝癌队列验证DeepProg亚型预测。HCC的RNA-Seq验证数据集：A LIRI（n = 230）和B GSE（n = 221）；BRCA的验证数据集：C Patiwan（n = 159），D Metabric（n = 1981），E Anna（n = 249），和F Miller（n = 236）。

Fig. 5 (See legend on next page.) (See figure on previous page.)

Fig. 5 Pan-cancer analysis of RNA-Seq gene signatures in the worst survival vs. other groups. A Top 100 over- and under-expressed genes for RNA, MIR, and METH omics ranked by survival predictive power. The colors correspond to the ranks of the genes based on their --log10 (log-rank p value) of the univariate Cox-PH model. Based on these scores, the 32 cancers and the features are clustered using the WARD method. B Co expression network constructed with the top 200 differentially expressed genes from the 32 cancers. The 200 genes are clustered from the network topology with the Louvain algorithm. For each submodule, we identified the most significantly enriched pathway as shown on the figure. C The expression values of these 200 genes used to construct the co-expression network. A clustering of the cancers using these features with the WARD method is represented in the x-axis

图5 全癌症分析在最差生存组与其他组中的RNA-Seq基因签名。A 对于RNA、MIR和METH组学，根据生存预测能力排名的前100个过表达和低表达基因。颜色对应于基于它们的-log10（对数秩p值）的单变量Cox-PH模型的基因排名。基于这些得分，使用WARD方法对32种癌症和特征进行聚类。B 使用来自32种癌症的前200个差异表达基因构建的共表达网络。这200个基因根据网络拓扑结构使用Louvain算法进行聚类。对于每个子模块，我们识别了图中显示的最显著富集的途径。C 构建共表达网络所使用的这200个基因的表达值。使用这些特征对癌症进行聚类，通过WARD方法在x轴上表示。

Fig. 6 Transfer learning to predict survival subtypes of certain cancers using the DeepProg models trained by different cancers. A Heatmap of the Cox-PH log-rank p values for the subtypes inferred using each cancer as the training dataset. B Kaplan-Meier plot of predicted subtypes for COAD, using the DeepProg model trained on STAD. C Kaplan-Meier plot of predicted subtypes for STAD, using the DeepProg model trained on COAD

图6 使用由不同癌症训练的DeepProg模型通过迁移学习预测特定癌症的生存亚型。A 使用每种癌症作为训练数据集推断出的亚型的Cox-PH对数秩p值的热图。B 使用基于STAD训练的DeepProg模型预测的COAD亚型的Kaplan-Meier曲线。C 使用基于COAD训练的DeepProg模型预测的STAD亚型的Kaplan-Meier曲线。