ULS23 挑战：用于计算机断层扫描中 3D 通用病变分割的基准模型及基准数据集|文献速递-深度学习医疗AI最新文献

Title

题目

The ULS23 challenge: A baseline model and benchmark dataset for 3D universal lesion segmentation in computed tomography

ULS23 挑战：用于计算机断层扫描中 3D 通用病变分割的基准模型及基准数据集

文献速递介绍

每年进行的CT检查数量持续增长（Masjedi等，2020），这导致放射科医生的工作量不断增加（McDonald等，2015）。据预测，与2020年相比，2040年全球癌症负担将增加47%（Sung等，2021），肿瘤放射学预计将显著加剧这些不断增长的需求。癌症患者在治疗及后续疾病监测期间通常需要接受多次影像学检查（Rehani等，2020）。此外，通过影像学手段进行癌症早期检测的关注度也在持续上升（Crosby等，2022；Adams等，2023）。阅片过程中的计算机辅助可能有助于放射科医生有效应对这一日益增长的工作负荷。在放射科医生的最少指导下，自动分割模型可减少肿瘤扫描中手动标注病变相关的时间负担和观察者间差异。病变分割的选择可通过单次点击（Tang等，2020）、边界框标注（Mazurowski等，2023；Ma等，2024）或检测模型（Yan等，2019）实现。从分割病变中获得的纵向测量值可根据实体瘤疗效评价标准（RECIST）等临床指南进行分析（Eisenhauer等，2009；Schwartz等，2016）。此外，病变的自动三维分割有助于更复杂的分析，例如利用影像组学特征（Gillies等，2016）区分病变亚型。再者，配准算法可用于将分割的病变传播至随访扫描中，这能在后续检查中节省大量时间（Hering等，2024）。在过去十年中，人工智能驱动的自动肿瘤分割模型在肝脏（Bilic等，2023）、肾脏（Heller等，2023）或肺部（Pedrosa等，2021）等特定高关注度病变类型的分割中取得了显著进展。然而，针对更广泛病变类型（尤其是常被检查的胸腹部区域）的分割模型仍相对缺乏研究。这些通用病变分割（ULS）模型的开发需要多样化的训练数据集。尽管此前已有关于ULS的研究（见第2.4节），但大多数研究严重依赖DeepLesion数据集。该数据集仅包含单个轴位切片的病变直径标注，因此不太适合开发三维分割模型。此外，先前研究（Cai等，2018；Tang等，2020）在该数据集上评估时使用的真实分割掩码未公开，这限制了研究的可重复性。ULS模型公开发布的情况也较为少见，这阻碍了其融入研究人员的标注工作流程或用于进一步临床评估。 ### 研究贡献鉴于上述现状，我们发起了ULS23挑战赛，其贡献包括： - 推动模型性能提升：通过收集大规模多样化的训练数据集实现。我们引入了针对胰腺和骨骼病变的两个新数据集（这些病变传统上分割难度较高），并将10个包含病变分割组件的公开数据集整合为一个易于访问的数据存储库。 - 增强ULS研究的可重复性：使用精心筛选的测试集（包含来自两家荷兰医疗中心的临床相关病变）建立可靠的基准。 - 促进研究社区获取前沿ULS模型：开发并公开发布我们的基线半监督ULS模型。

Abatract

摘要

Size measurements of tumor manifestations on follow-up CT examinations are crucial for evaluating treatmentoutcomes in cancer patients. Efficient lesion segmentation can speed up these radiological workflows. Whilenumerous benchmarks and challenges address lesion segmentation in specific organs like the liver, kidneys,and lungs, the larger variety of lesion types encountered in clinical practice demands a more universalapproach. To address this gap, we introduced the ULS23 benchmark for 3D universal lesion segmentationin chest-abdomen-pelvis CT examinations. The ULS23 training dataset contains 38,693 lesions across thisregion, including challenging pancreatic, colon and bone lesions. For evaluation purposes, we curated adataset comprising 775 lesions from 284 patients. Each of these lesions was identified as a target lesion in aclinical context, ensuring diversity and clinical relevance within this dataset. The ULS23 benchmark is publiclyaccessible at https://uls23.grand-challenge.org, enabling researchers worldwide to assess the performanceof their segmentation methods. Furthermore, we have developed and publicly released our baseline semisupervised 3D lesion segmentation model. This model achieved an average Dice coefficient of 0.703 ± 0.240on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.

在癌症患者的治疗效果评估中，随访CT检查时对肿瘤表现的尺寸测量至关重要。高效的病变分割可加速这些放射科工作流程。尽管众多基准和挑战赛已针对肝脏、肾脏和肺等特定器官的病变分割问题，但临床实践中遇到的病变类型更为多样，这就需要一种更具通用性的方法。为填补这一空白，我们推出了适用于胸腹部盆腔CT检查中3D通用病变分割的ULS23基准。 ULS23训练数据集包含该区域的38,693个病变，其中包括具有挑战性的胰腺、结肠和骨病变。为进行评估，我们精心整理了一个由284名患者的775个病变组成的数据集。这些病变在临床场景中均被确定为目标病变，从而确保了该数据集中的多样性和临床相关性。ULS23基准可在https://uls23.grand-challenge.org公开获取，使全球研究人员能够评估其分割方法的性能。此外，我们已开发并公开发布了基线半监督3D病变分割模型。该模型在挑战测试集上取得了0.703±0.240的平均骰子系数。我们诚邀持续提交作品，以推动未来ULS模型的发展。

Method

方法

In conjunction with the ULS23 challenge, we developed a baselinemodel using the challenge dataset and the LNDb data. To assist participants in preparing their algorithms for the challenge infrastructure,we released the model weights, training code and algorithm container.Additionally, the algorithm can be accessed on the Grand Challengeplatform,1 where users can upload their own data to be segmentedecould erroneously yield a low measurement error. Finally, a subset oflesions is included multiple times during evaluation of the validationand test set, using randomly sampled lesion foreground voxels as thecenter locations. This results in slight variations on the scan contextfor each cropped VOI. We check whether the model outputs similar predictions using these different click locations by comparing theSørensen--Dice coefficient of the re-aligned segmentation masks. To encourage model robustness to variations in click-point locations, a 10%weight is assigned to this segmentation consistency score (SCS). Whilesecondary to segmentation performance, in cases where two modelsperform similarly, the more robust model that maintains performanceacross different input variations should be preferred.

结合ULS23挑战赛，我们利用该挑战数据集和LNDb数据开发了一个基线模型。为帮助参赛者将算法适配挑战平台架构，我们发布了模型权重、训练代码及算法容器。此外，该算法可在Grand Challenge平台1上访问，用户可上传自有数据进行分割。值得注意的是，模型可能因错误地生成低测量误差的结果。最后，在验证集和测试集的评估中，部分病变会被多次纳入------通过随机采样病变前景体素作为中心位置，使每个裁剪的感兴趣体积（VOI）的扫描上下文略有差异。我们通过比较重新对齐的分割掩码的索伦森-骰子系数（Sørensen--Dice coefficient），来检验模型在不同点击位置下是否输出相似预测。为鼓励模型对点击位置变化的鲁棒性，该分割一致性评分（SCS）将被赋予10%的权重。尽管此评分次于分割性能，但当两个模型表现相近时，更倾向选择在不同输入变化中保持性能稳定性的模型。

Conclusion

结论

This paper presents the ULS23 challenge, establishing the firstpublic benchmark for the evaluation of 3D universal lesion segmentation models on computed tomography scans. We introduce noveltraining data for bone and pancreas lesions for which only limitedpublic data was previously available. The challenge training datasetfeatures a unique combination of fully- and partially-annotated data.To demonstrate the potential of this combined dataset, we developeda strategy for predicting 3D pseudo-masks from the partially-annotated2D data, allowing for their inclusion in 3D model development. Usingthis approach, we iteratively trained a semi-supervised ULS model thatleverages the entire training dataset. For evaluation purposes, we assembled a high-quality and diverse test set of lesions that were selectedfor RECIST measurement in clinical practice. By focusing on clinicallyrelevant target lesions, our benchmark is tightly integrated with thepractical requirements of radiologists. Our scaled-up, semi-supervisedmodel achieves a Dice score of 0.703 ± 0.240 on this test set, comparedto a Dice score of 0.651±0.253 for a standard, automatically-configurednnUnet. The model weights, data processing code and evaluation scriptsare publicly released to ensure transparency and reproducibility. Future work will include a meta-analysis of the methods developed bychallenge participants and an assessment of how these models couldreduce reading times for oncological scans. Subsequent iterations ofthe challenge can explore various aspects of ULS model developmentsuch as prioritizing segmentation performance over inference speed, expanding evaluation on rare lesion types or including different imagingmodalities.

本文介绍了ULS23挑战赛，该挑战赛建立了首个用于评估计算机断层扫描（CT）上3D通用病变分割模型的公开基准。我们为骨骼和胰腺病变引入了新的训练数据，而此前这些病变的公开数据非常有限。该挑战赛的训练数据集具有全标注数据和部分标注数据的独特组合。为了展示这种组合数据集的潜力，我们开发了一种从部分标注的2D数据中预测3D伪掩码的策略，使其能够纳入3D模型开发。使用这种方法，我们迭代训练了一个半监督ULS模型，该模型利用了整个训练数据集。为了进行评估，我们组装了一个高质量且多样化的病变测试集，这些病变是在临床实践中选择用于实体瘤疗效评价标准（RECIST）测量的。通过关注临床相关的目标病变，我们的基准与放射科医生的实际需求紧密结合。我们的规模化半监督模型在该测试集上实现了0.703±0.240的Dice系数，而标准自动配置的nnUnet的Dice系数为0.651±0.253。模型权重、数据处理代码和评估脚本已公开发布，以确保透明度和可重复性。未来的工作将包括对挑战赛参与者开发的方法进行荟萃分析，以及评估这些模型如何减少肿瘤扫描的阅片时间。该挑战赛的后续迭代可以探索ULS模型开发的各个方面，例如将分割性能优先于推理速度、扩展对罕见病变类型的评估或纳入不同的成像模式。

Results

结果

Table 2 shows the mean Dice and HD95 scores with standard deviation across the various model configurations for both the held-outtraining data and the challenge test set. We report the results for eachlesion type from the fully annotated data. Additionally, for the test set,we also include results for the lesion types that were not seen duringfully supervised training (PSUP).Fig. 4 shows the long- and short-axis axial measurement errordistributions for the different lesion types in the held-out training dataas predicted by the semi-supervised model. On the test set it achievesan overall ChallengeScore of 0.729, consisting of a mean Dice score of0.703 ± 0.240, long-axis SMAPE of 11.2% ± 15.8%, short-axis SMAPE of12.0% ±15.9%, and consistency Dice score of 0.787±0.252. Fig. 5 showsthe measurement error distribution for the full test set, and for thoselesions types which were or were not contained in the fully-annotatedtraining data.

表2显示了在保留的训练数据和挑战测试集上，各种模型配置的平均Dice系数和HD95分数（含标准差）。我们报告了全标注数据中每种病变类型的结果。此外，对于测试集，我们还包括了在全监督训练中未见过的病变类型（PSUP）的结果。图4显示了半监督模型在保留训练数据中对不同病变类型的长轴和短轴轴向测量误差分布。在测试集上，该模型的总体挑战分数为0.729，包括平均Dice系数0.703±0.240、长轴对称平均绝对百分比误差（SMAPE）11.2%±15.8%、短轴SMAPE 12.0%±15.9%，以及一致性Dice系数0.787±0.252。图5展示了整个测试集的测量误差分布，以及全标注训练数据中包含或未包含的病变类型的误差分布。

Figure

图

Fig. 1. Histograms depicting the long- and short-axis measurements in millimeters for various lesion types in the fully-annotated training data reveal notable trends. Kidney andcolon lesions tend to be larger on average. Lymph nodes, pancreas, and colon lesions exhibit a greater disparity between their long- and short-axis sizes, indicating that theselesions are more often non-spherical

图1. 全标注训练数据中各类病变长轴与短轴测量值（毫米）的直方图揭示了显著趋势。肾脏和结肠病变平均体积更大。淋巴结、胰腺和结肠病变的长轴与短轴尺寸差异更显著，表明这些病变多呈非球形。

Fig. 2. Examples of GrabCut pseudo-masks. From left to right, a kidney lesion, mediastinal lymph node, subcutaneous mass, and lung lesion. Note how GrabCut tends to oversegment(orange mask ■) into healthy tissues compared to the reference measurements (purple lines ■). Lung lesions are visualized using Window Level: −500 HU, Window Width: 1400HU. Lesions outside the lungs with WL: 350 WW: 40.

图2. GrabCut伪掩码示例从左至右依次为肾脏病变、纵隔淋巴结、皮下肿块和肺部病变。请注意，与参考测量值（紫色线条■）相比，GrabCut倾向于将健康组织过度分割（橙色掩码■）。肺部病变采用窗宽窗位：-500 HU（窗位），1400 HU（窗宽）显示，肺部外病变则使用350（窗位）/400（窗宽）。

Fig. 3. Training pipeline for the semi-supervised baseline model. (a) In the first training iteration a nnUnet is pretrained using the 2D GrabCut masks generated from the partiallyannotated data, and then fine-tuned on the fully annotated data. (b) In the second training iteration a different nnUnet is pretrained using the predicted 3D pseudo-masks for thepartially annotated data and then fine-tuned using the fully-annotated data

图3. 半监督基线模型的训练流程（a）在第一次训练迭代中，使用部分标注数据生成的2D GrabCut掩码对nnUnet进行预训练，然后在全标注数据上进行微调。（b）在第二次训练迭代中，使用部分标注数据的预测3D伪掩码对另一个nnUnet进行预训练，然后使用全标注数据进行微调。

Fig. 4.Boxplots of the long- and short-axis measurement errors for the baseline model on the different lesion types in the held-out training data. SAPE = Symmetric AveragePrediction Error.

图 4. 基线模型在保留训练数据中不同病变类型的长轴和短轴测量误差箱线图SAPE = 对称平均预测误差

Fig. 5. Boxplots of the long- and short-axis measurement errors for the baseline model on the test set. The fully-supervised types are lung, liver, kidney, colon, pancreas, bone lesionsand lymph nodes. Partially-supervised lesion types are those included in the partially annotated data e.g. adrenal, ovary, subcutaneous. SAPE = Symmetric Absolute PercentageError.

图 5. 基线模型在测试集上长轴和短轴测量误差的箱线图全监督类型包括肺、肝、肾、结肠、胰腺、骨病变和淋巴结；部分监督病变类型为部分标注数据中包含的类型，如肾上腺、卵巢、皮下组织等。SAPE = 对称绝对百分比误差

Fig. 6. Ground truth (orange outline ■) and baseline model predictions (purple outline ■) on axial slices from the test set. The 3D Dice score for each lesion is included inthe top-left corner. The lesions visualized are: (a) spleen lesion (b) lesion in the abdominal wall (c) adrenal lesion (d) abdominal lymph node (e) liver lesion (f) lung lesion (g)mediastinal lymph node (h) kidney lesion (i) Pericardial lesion. Lung lesions are visualized using Window Level: −500 HU, Window Width: 1400 HU. Lesions outside the lungswith WL: 350 WW: 40.

图 6. 测试集轴向切片上的真实标注（橙色轮廓■）与基线模型预测（紫色轮廓■）每个病变的 3D Dice 系数标注于左上角。可视化的病变包括：(a) 脾脏病变 (b) 腹壁病变 (c) 肾上腺病变 (d) 腹部淋巴结 (e) 肝脏病变 (f) 肺部病变 (g) 纵隔淋巴结 (h) 肾脏病变 (i) 心包病变。肺部病变采用窗宽窗位：-500 HU（窗位），1400 HU（窗宽）显示，肺部外病变使用 350（窗位）/400（窗宽）。

Fig. A.7. Age, sex and scanner manufacturer characteristics of the novel training data and the test set. For 3 series of the Radboudumc-Bone dataset and 13 series of theRadboudumc-Pancreas dataset the metadata could not be recovered

图 A.7. 新增训练数据与测试集的年龄、性别及扫描仪制造商特征Radboudumc - 骨骼数据集的 3 个序列和 Radboudumc - 胰腺数据集的 13 个序列元数据无法恢复

Fig. A.8. Study date and scan spacing distributions for the series included in the two novel training datasets.

图 A.8. 两个新增训练数据集所含序列的研究日期与扫描层间距分布

Fig. A.9. Plots of the Dice score vs the long- and short-axis measurement error for the baseline model on the different lesion types in the held-out training data. SAPE = SymmetricAbsolute Percentage Error.

图A.9. 基线模型在保留训练数据中不同病变类型的Dice系数与长/短轴测量误差关系图 SAPE=对称绝对百分比误差

Fig. A.10. Plots of the Dice score vs the long- and short-axis measurement error for the baseline model on the test data, split on lesion types seen in the fully-annotated dataversus those in the partially annotated data. SAPE = Symmetric Absolute Percentage Error.

图A.10. 基线模型在测试数据上按全标注数据病变类型与部分标注数据病变类型划分的Dice系数与长/短轴测量误差关系图 SAPE=对称绝对百分比误差

Fig. A.11. Plots of the sorted pairwise difference in Dice score between the up-scaled residual encoder nnUnet trained with semi-supervision and the self-configured nnUnet.The left graphs contain the lesion types in the test set covered by the fully-annotated training data, the right graphs contain the scores for the lesion types only present in thepartially-annotated data. A negative score indicates the segmentation performance of the regular nnUnet was better for that case, a positive score indicates the semi-supervisednnUnet scored higher for that case. The black vertical line indicates 50% of the lesions, the orange line denotes where the score changed from negative to positive. TTA = TestTime Augmentations

图A.11. 半监督训练的放大残差编码器nnUnet与自配置nnUnet之间Dice系数的排序成对差异图。左图包含全标注训练数据覆盖的测试集病变类型，右图包含仅存在于部分标注数据中的病变类型。负分表示常规nnUnet在该案例中的分割性能更好，正分表示半监督nnUnet在该案例中得分更高。黑色竖线表示50%的病变，橙色线表示分数从负变正的位置。TTA=测试时增强

Fig. A.12. Plots of the sorted pairwise difference in Dice score between the up-scaled residual encoder nnUnet trained with and without semi-supervision. The left graphs containthe lesion types in the test set covered by the fully-annotated training data, the right graphs contain the scores for the lesion types only present in the partially-annotated data. Anegative score indicates the segmentation performance of the fully-supervised nnUnet was better for that case, a positive score indicates the semi-supervised nnUnet scored higherfor that case. The black vertical line indicates 50% of the lesions, the orange line denotes where the score changed from negative to positive. TTA = Test Time Augmentations.

图A.12. 半监督训练与非半监督训练的放大残差编码器nnUnet之间Dice系数的排序成对差异图。左图包含全标注训练数据覆盖的测试集病变类型，右图包含仅存在于部分标注数据中的病变类型。负分表示全监督nnUnet在该案例中的分割性能更好，正分表示半监督nnUnet在该案例中得分更高。黑色竖线表示50%的病变，橙色线表示分数从负变正的位置。TTA=测试时增强。

Table

表

Table 1Overview of the data used in the ULS23 challenge. The LNDb data licence does not allow repackaging their data, so it is not released as part of the trainingarchive (de Grauw et al., 2023a). Instead, we release the code for participants to prepare the lesion VOI's for this dataset themselves (de Grauw et al., 2023b).

表1 ULS23挑战赛使用的数据概述由于LNDb数据集的许可协议不允许重新打包其数据，因此该数据未作为训练归档的一部分发布（de Grauw等人，2023a）。相反，我们发布了代码，供参与者自行准备该数据集的病变感兴趣体积（VOI）（de Grauw等人，2023b）。

Table 2Segmentation performance comparison on the 10% held-out training data per lesion type and the test set. Best results per category are highlighted in bold. HD95 represents the95th percentile of the Hausdorff distance, measured in millimeters. For the individual lesion types in the test set, * indicates there were ≤ 20 lesions of this type in the test set.The exact distribution of lesion types is not provided to participants. FSUP = fully supervised lesions types (i.e. Kidney - Colon). PSUP = lesion types present in the partiallysupervised training data

表 2 按病变类型在 10% 保留训练数据及测试集上的分割性能对比每类最佳结果以粗体突出显示。HD95 表示豪斯多夫距离的第 95 百分位数（单位：毫米）。测试集中的个别病变类型标注表示该类型在测试集中的病例数≤20。参赛者未获得病变类型的具体分布。FSUP = 全监督病变类型（如肾脏 - 结肠），PSUP = 部分监督训练数据中存在的病变类型。

Table A.3Segmentation performance comparison with test time augmentation on the 10% held-out training data per lesion type and the test set. Best results per category are highlighted inbold. HD95 represents the 95th percentile of the Hausdorff distance, measured in millimeters. For the individual lesion types in the test set, indicates there were ≤ 20 lesionsof this type in the test set. The exact distribution of lesion types is not provided to participants. FSUP = fully supervised lesions types (i.e. Kidney - Colon). PSUP = lesion typespresent in the partially supervised training data.D

表 A.3 结合测试时数据增强的 10% 保留训练数据及测试集分割性能对比每类最佳结果以粗体突出显示。HD95 表示豪斯多夫距离的第 95 百分位数（单位：毫米）。测试集中的个别病变类型标注表示该类型在测试集中的病例数≤20。参赛者未获得病变类型的具体分布。FSUP = 全监督病变类型（如肾脏 - 结肠），PSUP = 部分监督训练数据中存在的病变类型。

Table A.4Hyperparameters of the baseline model (nnUnet-ResEnc+SS) and the intensity properties used for data normalization, calculatedfrom the pretraining data

表 A.4 基线模型（nnUnet-ResEnc+SS）的超参数及用于数据归一化的强度属性（基于预训练数据计算）

Table A.5List of the patient IDs per dataset in the held-out training data

表A.5 保留训练数据中每个数据集的患者ID列表