文献速递：肺癌早期诊断---利用低剂量CT扫描的三维概率深度学习系统用于肺癌的检测与诊

Title

题目

A 3D Probabilistic Deep Learning System forDetection and Diagnosis of Lung Cancer Using Low-Dose CT Scans

利用低剂量CT扫描的三维概率深度学习系统用于肺癌的检测与诊

01文献速递介绍

肺癌既是最常见的癌症之一，也是导致癌症死亡的主要原因之一，在美国约占所有癌症相关死亡的四分之一。肺癌的高死亡率部分是因为其症状只有在癌症已经处于晚期时才变得明显。低剂量计算机断层扫描（CT）已被提议作为一种安全有效的工具，用于高风险人群的预防性筛查。相对于每年的胸部X射线摄影，每年进行CT筛查可以将肺癌的死亡率至少降低20％，经过7年可以看到效果。

虽然肺部CT筛查有潜力大幅减少与肺癌相关的死亡人数，但使筛查准确和高效化的放射科医生的负担很重。自动化算法解决方案可能有助于减轻这一负担，但是当这些算法无法可靠地传达其不确定性时，算法解决方案与医生之间的接口也是一个挑战。为了解决这些需求，我们引入了一个基于深度3D卷积神经网络（CNN）构建的端到端概率诊断系统，用于肺癌的诊断。我们的系统直接分析CT扫描，并提供经过校准的概率评分，准确地描述了不确定性。

我们的系统主要由两个组件组成：1）计算机辅助检测（CADe）模块，用于检测和分割可疑的肺结节，以及2）计算机辅助诊断（CADx）模块，通过分析CADe中的可疑病灶，进行结节级别评估和患者级别的恶性分类。我们的CADe和CADx模块在LUNA16和Kaggle数据科学碗2017等基准测试中的性能与或优于最佳已发表的CADe和CADx系统，尽管我们的系统仅在这些数据集提供的有限数据和标签上进行训练，而不像其他研究使用了额外的训练数据。

Abstract

摘要

We introduce a new computer aided detec tion and diagnosis system for lung cancer screening with low-dose CT scans that produces meaningful probability assessments. Our system is based entirely on 3D convo lutional neural networks and achieves state-of-the-art per formance for both lung nodule detection and malignancy classification tasks on the publicly available LUNA16 and Kaggle Data Science Bowl challenges. While nodule detec tion systems are typically designed and optimized on their own, we find that it is important to consider the coupling between detection and diagnosis components. Exploiting this coupling allows us to develop an end-to-end system that has higher and more robust performance and eliminates the need for a nodule detection false positive reduction stage. Furthermore, we characterize model uncertainty in our deep learning systems, a first for lung CT analysis, and show that we can use this to provide well-calibrated classification probabilities for both nodule detection and patient malignancy diagnosis. These calibrated probabili ties informed by model uncertainty can be used for sub sequent risk-based decision making towards diagnostic interventions or disease treatments, as we demonstrate using a probability-based patient referral strategy to further improve our results.

我们引入了一种新的计算机辅助肺癌筛查的检测和诊断系统，使用低剂量CT扫描产生有意义的概率评估。我们的系统完全基于3D卷积神经网络，并在公开可用的LUNA16和Kaggle数据科学碗挑战中实现了肺结节检测和恶性分类任务的最新性能。虽然结节检测系统通常是独立设计和优化的，但我们发现考虑检测和诊断组件之间的耦合是重要的。利用这种耦合，我们能够开发一个端到端系统，具有更高和更稳健的性能，并消除结节检测假阳性减少阶段的需要。此外，我们对深度学习系统中的模型不确定性进行了表征，这在肺部CT分析中是首次，我们表明我们可以利用这一点为结节检测和患者恶性诊断提供良好校准的分类概率。这些受模型不确定性启发的校准概率可以用于后续基于风险的决策，以进行诊断干预或疾病治疗，正如我们所展示的，使用基于概率的患者转诊策略进一步改善我们的结果。

Conclusions

结论

In this paper, we introduced a full CADe/CADx system to detect and diagnose lung cancer using low-dose CT scans. Our system uses a cascade of 3D CNNs and achieves state-of-the art performance on both lung nodule detection and malignancy classification problems on the publicly available LUNA16 and Kaggle datasets. Moreover, we characterized model uncer tainty using Monte Carlo dropout and deep ensembles, and showed that quantification of model uncertainty enables our system to provide calibrated classification probabilities, which makes it reliable for subsequent utility/risk-based decision making towards diagnostic interventions or disease treatments. We demonstrated that we can further improve the performance by using these calibrated probabilities to make patient referral decisions. Our CADe/CADx system studies demonstrate that CADe and CADx modules should be developed and studied jointly if the goal is to use them as an end-to-end automated diagnosticS In this paper, we introduced a full CADe/CADx system to detect and diagnose lung cancer using low-dose CT scans. Our system uses a cascade of 3D CNNs and achieves state-of-the art performance on both lung nodule detection and malignancy classification problems on the publicly available LUNA16 and Kaggle datasets. Moreover, we characterized model uncer tainty using Monte Carlo dropout and deep ensembles, and showed that quantification of model uncertainty enables our system to provide calibrated classification probabilities, which makes it reliable for subsequent utility/risk-based decision making towards diagnostic interventions or disease treatments. We demonstrated that we can further improve the performance by using these calibrated probabilities to make patient referral decisions.

在本文中，我们介绍了一个完整的CADe/CADx系统，用于使用低剂量CT扫描检测和诊断肺癌。我们的系统使用一系列3D卷积神经网络，并在公开可用的LUNA16和Kaggle数据集上实现了肺结节检测和恶性分类问题的最新性能。此外，我们使用蒙特卡罗辍学和深度集成来表征模型不确定性，并表明模型不确定性的量化使我们的系统能够提供校准的分类概率，从而使其可靠用于后续基于效用/风险的决策，用于诊断干预或疾病治疗。我们证明了通过使用这些校准的概率来做出患者转诊决策，可以进一步提高性能。

我们的CADe/CADx系统研究表明，如果目标是将它们用作端到端的自动诊断系统，那么CADe和CADx模块应该共同开发和研究。

Results

结果

We evaluate our CADe system on the LUNA16 benchmark and our CADx system on the Kaggle Stage-2 test set. Since CADx directly relies on CADe, the success of the CADx system acts as additional validation of the CADe system and its ability to generalize to an independent dataset. Likewise, the CADx system is trained and validated on the Kaggle Stage- 1 dataset but tested on the Kaggle Stage-2 dataset, which is more recent and has different image quality

我们在LUNA16基准测试上评估了我们的CADe系统，并在Kaggle第二阶段测试集上评估了我们的CADx系统。由于CADx直接依赖于CADe，因此CADx系统的成功充当了对CADe系统及其泛化到独立数据集的能力的额外验证。同样，CADx系统是在Kaggle第一阶段数据集上进行训练和验证的，但在Kaggle第二阶段数据集上进行测试，该数据集更为最新，图像质量也有所不同。

Figure

图

Fig. 1. Our overall CAD system diagram. Since CADx performance is so reliant on the quality of the nodule candidates generated by the CADe, both were developed simultaneously to achieve the best overallperformance.

图1.我们整体的CAD系统示意图。由于CADx的性能非常依赖于CADe生成的结节候选区的质量，因此两者同时开发，以实现最佳的整体性能。

Fig. 2. Randomly sampled augmentations of a single nodule demon strating our extensive augmentation transforms used during model training.

图2.随机采样的单个结节的增强示例，展示了我们在模型训练期间使用的广泛增强变换。

Fig. 3. Sample nodule segmentations from our CADe segmentation model, sliced through the center of each nodule candidate. First row: Input CT scan images from LIDC-IDRI test data. Second row: Our cor responding segmentation probabilities. Third row: (Spherical) voxelwise labels extracted from the LUNA16 annotations.

*图3.*从我们的CADe分割模型中取样的结节分割示例，切片通过每个结节候选区的中心。第一行：来自LIDC-IDRI测试数据的输入CT扫描图像。第二行：我们对应的分割概率。第三行：从LUNA16注释中提取的（球形）体素标签。

Fig. 4. Base neural network architecture used for nodule candidate scor ing, malignancy ranking, and multiple-instance malignancy classification. The architecture hyperparameters were found through experimentation on the two CADx tasks, namely malignancy ranking and classification.

图4. 用于结节候选区评分、恶性排名和多实例恶性分类的基本神经网络架构。通过对两个CADx任务进行实验（即恶性排名和分类），找到了架构的超参数。

Fig. 5. Example nodule candidates with CADx malignancy probabilities along with corresponding candidate attention weights. Each row rep resents candidates from a specific patient. The scores on top of each candidate are the corresponding CADx network attention weights, which sum up to 1 and represent how much each candidate contributes rela tively to the final CADx score. Estimated patient-level CADx malignancy probability is given in the bottom of each row.

图5. 示范性结节候选区，带有CADx恶性概率以及对应的候选区注意权重。每一行代表来自特定患者的候选区。每个候选区顶部的分数是相应的CADx网络注意权重，它们总和为1，表示每个候选区相对于最终CADx得分的贡献程度。每行底部给出了估计的患者级CADx恶性概率。

Fig. 6. The free response operating characteristic (FROC) for our CADecandidate generation and scoring system on the LUNA16 dataset, withpatient-bootstrapped 95% confidence interval, and the same results withour comparison model architecture (3D U-Net.)

图6. 我们的CADe候选区生成和评分系统在LUNA16数据集上的自由响应操作特性（FROC），带有患者自助95%置信区间，以及相同结果与我们的比较模型架构（3D U-Net）的比较。

Fig. 7. CADe FROC on LUNA16 data, breaking down the sensitivityby nodule diameter. CADe sensitivity for the smallest group of nodules(between 3 mm and 5 mm diameter) is significantly worse than the sensitivity for larger nodules (between 5 mm and 30 mm diameter) at lower thresholds that correspond to reduced false positives.

图7.在LUNA16数据上的CADe自由响应操作特性（FROC），按结节直径分解敏感性。在对应于减少假阳性的较低阈值时，最小结节组（直径在3毫米到5毫米之间）的CADe敏感性明显比较大结节（直径在5毫米到30毫米之间）的敏感性差。

Fig. 8. The receiver operating characteristic (ROC) curve for our CADx system on the Kaggle Stage-2 test set, trained on both LUNA16 and Kaggle Stage-1 data, with patient-bootstrapped 95% confidence interval.

图8. 我们的CADx系统在Kaggle第二阶段测试集上的接收者操作特性（ROC）曲线，该系统在LUNA16和Kaggle第一阶段数据上进行了训练，带有患者自助95%置信区间。

Fig. 9. Probability calibration curves for the Bayesian approximationand non-Bayesian (standard) variants of the CADe candidate scoringneural networks, with patient-bootstrapped 95% confidence intervals.The estimated nodule probabilities output by the CADe scoring networkare well-calibrated when model uncertainty is included.

图9.贝叶斯近似和非贝叶斯（标准）变种的CADe候选区评分神经网络的概率校准曲线，带有患者自助95%置信区间。当模型不确定性被考虑进去时，CADe评分网络输出的结节概率估计是良好校准的。

Fig. 10. Probability calibration curves for the Bayesian ensemble and the best non-Bayesian CADx model, with patient-bootstrapped 95% con- fidence intervals. The estimated malignancy probabilities output by the CADx Bayesian ensemble are well-calibrated when model uncertainty isincluded.

图10.贝叶斯集成和最佳非贝叶斯CADx模型的概率校准曲线，带有患者自助95%置信区间。当模型不确定性被考虑进去时，CADx贝叶斯集成输出的恶性概率估计是良好校准的。

Fig. 11. The CADe scoring area under the precision-recall curve as a function of referral percentage, for the entropy referral strategy and a random strategy (with 95% confidence interval).

图11.CADe评分区域下的精确-召回曲线作为转诊百分比的函数，对于熵转诊策略和随机策略（带有95%置信区间）。

Fig. 12. The CADx area under the ROC curve as a function of referral percentage, for entropy referral strategy and a random strategy with 95% confidence interval. Note that the confidence interval is wider than the CADe scoring network because the number of patients in the Kaggle test set is smaller than the number of nodules in the LUNA16 dataset.

图12. CADx面积下的ROC曲线作为转诊百分比的函数，对于熵转诊策略和随机策略，带有95%置信区间。请注意，置信区间比CADe评分网络更宽，因为Kaggle测试集中的患者数量比LUNA16数据集中的结节数量少

Fig. 13. Histogram of CADx malignancy probability estimates for benign and malignant patients. Note that the mode around 0 for malignant patients is the reason behind false negatives due to missed nodules by the CADe model.

图13. CADx恶性概率估计的直方图，对于良性和恶性患者。请注意，恶性患者概率估计周围的0模式是由于CADe模型漏掉结节而导致假阴性的原因。

Table

表

TABLE I cade and CADx resuts by cade threshold false positive rate

表格 I 根据CADe阈值假阳性率的CADE和CADx结果

TABLE II CADX results CADE threshold for training and testing

表格 II 根据CADe阈值进行的CADx训练和测试结果