将空间信息融入深度学习参数估计中及其在扩散加权磁共振成像（MRI）体素内不相干运动模型中的应用|文献速递-医学影像人工智能进展

Title

题目

Incorporating spatial information in deep learning parameter estimationwith application to the intravoxel incoherent motion model in diffusion-weighted MRI

将空间信息融入深度学习参数估计中及其在扩散加权磁共振成像（MRI）体素内不相干运动模型中的应用

文献速递介绍

定量磁共振成像（MRI）能够提供生物物理组织类型和微观结构过程的参数信息（Novikov 等人，2019 年）。利用生物物理模型进行多组分信号分析可以提供此类潜在的参数信息，但其中一个挑战在于如何从获取的数据中准确估计模型参数，因为这涉及解决一个复杂的逆问题。传统的拟合方法，如最小二乘法（LSQ），认为体素之间相互独立，这使得它们容易受到噪声的影响（Kaandorp 等人，2023 年）。相比之下，组织微环境在局部环境中通常是均匀的，扩散和灌注等特性在相邻体素之间不会随机变化。因此，相邻体素之间存在相当强的空间相关性。通过利用这些相关性，有望实现更准确、更稳健的模型拟合。传统的拟合方法（Kayal 等人，2017 年；Lin 等人，

2017 年的研究表明，通过添加惩罚相邻体素差异的正则化项，可以促进空间同质性。然而，这些方法完全是非判别性的，因为它们不考虑局部结构，可能对真实的潜在异质性缺乏敏感性。因此，它们可能会导致局部平滑而无法保留边缘。为了通过学习空间模式的关系来整合空间信息，已经提出了几种深度神经网络（DNN）方法。在计算机视觉领域，最著名的或许是卷积神经网络（CNN）（Saxena，2022）。基本的 CNN 堆叠多个具有特定核大小的卷积层，这引入了对纹理识别的局部归纳偏差。CNN 之前已在扩散 MRI 的多个领域得到应用，例如癌症检测（Yoo 等人，2019 年）和分割（Chen 等人，2017 年），以及参数估计（Gibbons 等人，2019 年；Huang，2022 年；Ottens 等人，2022 年；Vasylechko 等人，2021 年）。其他被提出用于捕获空间信息的 DNN 架构是基于注意力的模型，例如变压器（Vaswani 等人，2017 年）。Transformer 模型基于注意力机制，使网络能够专注于序列输入的特定方面。Transformer 已成为自然语言处理和序列建模领域的前沿技术，并在计算机视觉任务中广受欢迎（Khan 等人，2022 年）。在参数估计方面，Karimi 和 Gholipour（2022 年）将 Transformer 应用于扩散张量成像（DTI）。通过利用相邻体素中扩散信号与张量值之间的空间相关性，Transformer 展示出了优于卷积神经网络（CNN）的性能。这些网络是通过使用重建的高质量 DTI 作为上采样的真实基准进行训练的。低质量的数据。在深度神经网络（DNN）的监督训练中引入空间同质性这对于多组分信号分析来说颇具挑战性。这主要是由于由于其涉及的内容，缺乏用于训练的可靠基准事实一个反问题。传统的拟合方法，例如最小二乘法（LSQ），是对噪音高度敏感，这使得它们成为生成器的不可靠选择。确定真实参数值。此外，获取高高质量的数据可能难以获取，噪声仍可能影响分析结果。获取的数据。相比之下，模拟训练数据则允许任何可能的从具有代表性的真实情况生成磁共振成像（MRI）信号，以及相邻元素之间的相关性可以综合地引入到一个写实风格的时尚。基于这些观察结果，我们假设注意力模型可以通过在合成数据上进行训练来学习融合空间信息具有相邻数据之间相关性的数据，从而表现出色替代模型拟合方法。我们提供了一个演示。扩散加权成像的体素内不相干运动（IVIM）模型成像（弥散加权成像）。我们比较了几种深度神经网络架构，包括基于卷积的网络和基于注意力的网络用于自我（学习）监督学习和无监督学习。在模拟中，我们定量地通过在......上进行训练来探究性能提升的潜力更大的社区，并探索改进我们应用程序的可能性通过纳入与之相关的更多先验信息来改进方法将相应的测试数据纳入训练。此外，我们还研究了这些方法在一名健康志愿者体内数据上的表现。

Abatract

摘要

In medical image analysis, the utilization of biophysical models for signal analysis offers valuable insights intothe underlying tissue types and microstructural processes. In diffusion-weighted magnetic resonance imaging(DWI), a major challenge lies in accurately estimating model parameters from the acquired data due to theinherently low signal-to-noise ratio (SNR) of the signal measurements and the complexity of solving the ill-posedinverse problem. Conventional model fitting approaches treat individual voxels as independent. However, thetissue microenvironment is typically homogeneous in a local environment, where neighboring voxels maycontain correlated information. To harness the potential benefits of exploiting correlations among signals inadjacent voxels, this study introduces a novel approach to deep learning parameter estimation that effectivelyincorporates relevant spatial information. This is achieved by training neural networks on patches of syntheticdata encompassing plausible combinations of direct correlations between neighboring voxels. We evaluated theapproach on the intravoxel incoherent motion (IVIM) model in DWI. We explored the potential of several deeplearning architectures to incorporate spatial information using self-supervised and supervised learning. Weassessed performance quantitatively using novel fractal-noise-based synthetic data, which provide ground truthspossessing spatial correlations. Additionally, we present results of the approach applied to in vivo DWI dataconsisting of twelve repetitions from a healthy volunteer. We demonstrate that supervised training on largerpatch sizes using attention models leads to substantial performance improvements over both conventionalvoxelwise model fitting and convolution-based approaches.

在医学图像分析中，利用生物物理模型进行信号分析为了解潜在的组织类型和微观结构过程提供了有价值的见解。在扩散加权磁共振成像（DWI）中，由于信号测量固有的低信噪比（SNR）以及求解不适定反问题的复杂性，从获取的数据中准确估计模型参数是一个重大挑战。传统的模型拟合方法将单个体素视为独立的。然而，组织微环境在局部环境中通常是均匀的，相邻体素可能包含相关的信息。为了利用相邻体素信号之间的相关性所带来的潜在益处，本研究引入了一种新的深度学习参数估计方法，该方法有效地整合了相关的空间信息。这是通过在包含相邻体素之间可能直接相关性的合成数据块上训练神经网络来实现的。我们在 DWI 中的体素内不相干运动（IVIM）模型上对该方法进行了评估。我们探索了几种深度学习架构在利用自监督和监督学习纳入空间信息方面的潜力。我们使用基于新型分形噪声的合成数据对性能进行了定量评估，这些数据提供了具有空间相关性的真实基准。此外，我们还展示了该方法应用于健康志愿者体内十二次重复的弥散加权成像（DWI）数据的结果。我们证明，使用注意力模型在较大补丁尺寸上进行监督训练，相比传统的体素级模型拟合和基于卷积的方法，性能有了显著提升。

Background

背景

2.1. Learning strategies

In deep learning, there exist several types of learning strategies interms of the level of supervision. In this work, we focus on selfsupervised and supervised learning. In self-supervised learning (alsocommonly referred to as unsupervised) for DWI parameter estimation,networks are trained using the measured (noisy) DWI signals as groundtruth, denoted as S(b). The network's input typically consists of theseDWI signals. The network's output typically comprises the parameterestimates for the biophysical model (̂θnet), which are subsequently usedto generate model-predicted DWI signals, denoted as Snet(b). Thenetwork is optimized using a loss function, typically the mean-squarederror (MSE), between the input measured signal S(b) and the modelpredicted signal Snet(b), denoted in this manuscript as "signals-MSE":signals-MSE = L(S(b), Snet(b)) = ∑b∈B‖S(b*) − Snet(b)‖2 . (1)For supervised learning, networks are trained instead using predefined model parameters, ̂θ, as ground truth, and using the biophysicalmodel to generate synthetic (noisy) DWI signals as the input. The lossfunction is typically equal to the MSE between the network's outputparameters, ̂θnet, and the normalized ground-truth parameters, ̂θ (scaledbetween 0 and 1), denoted in this manuscript as "parameters-MSE":parameters-MSE = L(̂θ, ̂*θnet) = ∑parameters‖̂θ − ̂θnet‖2 . (2)

2.1. 学习策略

在深度学习中，根据监督程度的不同，存在多种学习策略。在本研究中，我们重点关注自监督学习和监督学习。在自监督学习（也常被称为无监督学习）中，用于弥散加权成像（DWI）参数估计时，网络使用测量得到的（含噪声的）DWI 信号作为真实值，记为 S(b)。网络的输入通常由这些 DWI 信号组成。网络的输出通常包括生物物理模型的参数估计值（̂θnet），随后这些估计值被用于生成模型预测的 DWI 信号，记为 Snet(b)。网络通过损失函数进行优化，通常采用均方误差（MSE），即输入测量信号 S(b) 与模型预测信号 Snet(b) 之间的误差，本文中记为"信号-MSE"：信号-MSE = L(S(b), Snet(b)) = ∑b∈B‖S(b) − Snet(b)‖2 。 (1) 对于监督学习，网络则是使用预先定义的模型参数 ̂θ 作为真实值进行训练，并利用生物物理模型生成合成的（含噪声的）DWI 信号作为输入。损失函数通常等于网络输出参数 ̂θnet 与归一化的真实参数 ̂θ（缩放至 0 到 1 之间）之间的均方误差，本文将其记为"参数-MSE"：参数-MSE = L(̂θ， ̂θnet) = ∑parameters‖̂θ − ̂θnet‖2 。 (2)

Method

方法

This section describes four sub-studies, summarized as follows: Exploration of network architectures and learning strategies: We introduce our synthetic training data that share neighborhoodinformation and explore different network architectures for the potential of incorporating spatial information with a receptive field of3×3, including convolution-based networks and attention-basednetworks trained either self-supervised or supervised. The syntheticdata consists of 3×3 patches where parameter values are sampledfrom a uniform distribution, and where the training encompasses allpossible combinations of neighboring correlations. attention concepts: We explore training on synthetic data withlarger receptive fields, and the different concepts of self-attentionand neighborhood-attention for this purpose. We explore whether estimator performance may be improved byincorporating specific features of the test set into the synthetictraining data, such as similar spatial variation and/or (non-uniform)distribution of parameter values. an in vivo dataset containing multiple repetitions of acquired data. In vivo analysis: We investigate the performance of our methods on Exploration of the representativeness of different training data: Exploration of training with larger receptive fields and different

本节描述了四项子研究，总结如下： 1. 网络架构与学习策略探究：我们引入了具有邻域信息的合成训练数据，并探究了不同的网络架构，这些架构具有3×3感受野，旨在挖掘整合空间信息的潜力，其中包括基于卷积的网络和基于注意力的网络，这些网络分别在自监督和监督的方式下进行训练。合成数据由3×3的数据块组成，其参数值从均匀分布中采样得到，并且训练涵盖了所有可能的邻域相关性组合。更大感受野及不同注意力概念的训练探究：我们探究了使用具有更大感受野的合成数据进行训练的情况，并且为此研究了自注意力和邻域注意力的不同概念。 3. 不同训练数据代表性的探究：我们探究了是否可以通过将测试集的特定特征融入到合成训练数据中，来提升估计器的性能，这些特征例如相似的空间变化和/或（非均匀的）参数值分布。 4. 体内数据分析：我们在一个包含多次采集数据重复的体内数据集上，研究了我们所提出方法的性能表现。

Conclusion

结论

This study presented a method for deep learning parameter estimation that effectively incorporates spatial information using attentionmodels, with application to the intravoxel incoherent motion (IVIM)model in diffusion-weighted imaging (DWI). By training on patches ofsynthetic data possessing spatial correlations, the networks learned toidentify relevant neighboring voxels and leverage the correlated information to improve estimation accuracy and reduce inherent bias.Attention models were found to be superior to convolution-based architectures for this purpose in terms of convergence speed and accuracy.The improvements were found to be exclusive to supervised learning,with no observed benefit for self-supervised learning. Quantitativeevaluation using synthetic data derived from novel fractal-based IVIMparameter maps with inherent spatial correlations showed that trainingwith a larger receptive field further enhances performance, while preserving edge-like structures. Neighborhood-attention was thereforepreferred to self-attention, since the latter was limited by excessivememory demands for larger receptive fields. Results for in vivo datawere in broad agreement with the simulations. We demonstrated thepotential for enhancing estimator accuracy further by including otherprior assumptions about the test data in the network's training process,and exploring these concepts further for in vivo applications should bethe topic of future work.

本研究提出了一种深度学习参数估计方法，该方法利用注意力模型有效地整合了空间信息，并将其应用于扩散加权成像（DWI）中的体素内不相干运动（IVIM）模型。通过在具有空间相关性的合成数据块上进行训练，这些网络学会了识别相关的相邻体素，并利用相关信息来提高估计精度并减少固有偏差。研究发现，就这一目的而言，注意力模型在收敛速度和准确性方面优于基于卷积的架构。并且这种改进仅存在于监督学习中，对于自监督学习则未观察到任何益处。使用从具有固有空间相关性的新型基于分形的IVIM参数图衍生出的合成数据进行的定量评估表明，使用更大的感受野进行训练能在保留类似边缘结构的同时，进一步提升性能。因此，邻域注意力优于自注意力，因为对于更大的感受野，自注意力会受到过高内存需求的限制。体内数据的结果与模拟结果大致相符。我们证明了通过在网络的训练过程中纳入关于测试数据的其他先验假设，进一步提高估计器准确性的潜力。未来的工作应该围绕进一步探索这些概念在体内应用方面的内容展开。

Results

结果

4.1. Exploration of network architectures and learning strategies

Fig. 3 shows the loss curves over the first 10,000 epochs for the sixnetworks consisting of a voxelwise network, a convolutionneighborhood network, and a transformer-neighborhood network thatwere trained either self-supervised or supervised on the neighborsrandom dataset of 3×3 patches. For self-supervised learning, all threenetworks demonstrated similar performance, achieving equal lossvalues. The transformer network exhibited the least spiky and fastestconvergence. For supervised learning, the transformer-neighborhoodnetwork outperformed both the convolution-neighborhood networkand the voxelwise network, showing lower loss values. Notably, thetransformer-neighborhood network displayed considerably fasterconvergence than the convolution-neighborhood network and reached alower final loss. Key to obtaining improved performance for theconvolution-neighborhood network was substantially longer training(see dip at epoch-6000). Even after 50,000 epochs, the convolutionneighborhood network was still slowly decreasing in loss butremained higher than the transformer-neighborhood network (see alsorightmost column of Fig. 4). In contrast, the transformer-neighborhoodnetwork had already achieved convergence after only 3000 epochs oftraining.

Fig. 4 displays the final loss values of the networks that were trainedsupervised on the neighbors-random dataset of 3×3 patches, whenapplied to the test subsets, each comprising a distinct number of correlated neighbors (neighbors-N), i.e. those sharing identical underlyingIVIM parameters. The transformer-neighborhood network exhibitedsuperior performance to the convolution-neighborhood network, evidenced by achieving lower loss values for individual patches and bymatching or surpassing the baseline optima. The neighborhood networks (both CNN and transformer) demonstrated enhanced performance when applied to patches possessing a higher number ofcorrelated neighbors. Note that the convolution-neighborhood networktrained on neighbors-random demonstrates a lower loss in some instances than the convolution-neighborhood networks trained onneighbors-N (baseline optima), which is indicative of the variability inthe training of the network and dependence on the chosen stoppingpoint (evidenced by the spiky convergence in Fig. 3). As expected, thevoxelwise network demonstrated the same loss regardless of the numberof correlated neighbors.

In Fig. 5, we present scatter plots comparing the estimates with theground truth values for the six networks, as described above, evaluatedon the neighbors-all test set. These scatter plots visually illustrate thatthe self-supervised networks exhibit similar performance, regardless ofthe capacity of the network to incorporate spatial information. The selfsupervised parameter estimates possess a distribution tending towardthat of LSQ fitting. In contrast, for supervised learning, the transformerneighborhood network and convolution-neighborhood network outperformed the voxelwise network. They exhibit a substantial decrease ininherent bias associated with supervised learning (e.g. less dense horizontal band of dark points). Moreover, they display a narrowing of thespread of values along the diagonal (exemplified by the thinner lightyellow band), indicating improved estimator accuracy. Indeed, similarnarrowing is observed for LSQ fitting, when it is applied to the mean ofthe 3×3 neighborhood (top right), and hence to higher SNR data (signalaveraged), rather than to the voxelwise data (top left).

4.1. 网络架构与学习策略的探究图3展示了由体素级网络、卷积邻域网络和Transformer邻域网络这六种网络在前10,000个训练周期的损失曲线，这些网络在由3×3数据块组成的邻域随机数据集上进行了自监督或监督训练。对于自监督学习，这三个网络表现出相似的性能，达到了相同的损失值。Transformer网络的波动最小，收敛速度也最快。对于监督学习，Transformer邻域网络的性能优于卷积邻域网络和体素级网络，损失值更低。值得注意的是，Transformer邻域网络的收敛速度明显快于卷积邻域网络，并且最终损失更低。要使卷积邻域网络性能提升，关键在于进行更长时间的训练（见第6000个训练周期时的下降点）。即使在经过50,000个训练周期后，卷积邻域网络的损失仍在缓慢下降，但仍高于Transformer邻域网络（另见图4最右侧一列）。相比之下，Transformer邻域网络仅在训练3000个周期后就已经收敛。图4展示了在由3×3数据块组成的邻域随机数据集上进行监督训练的网络，应用于每个包含不同数量相关邻域（邻域数量为N，即那些具有相同基础IVIM参数的邻域）的测试子集时的最终损失值。Transformer邻域网络的性能优于卷积邻域网络，这体现在对于单个数据块能达到更低的损失值，并且能达到或超过基线最优值。当应用于具有更多相关邻域的数据块时，邻域网络（包括卷积神经网络（CNN）和Transformer网络）表现出了更好的性能。请注意，在某些情况下，在邻域随机数据上训练的卷积邻域网络的损失比在邻域数量为N的数据上训练的卷积邻域网络（基线最优值）更低，这表明了网络训练的可变性以及对所选停止点的依赖性（如图3中波动的收敛情况所示）。正如预期的那样，体素级网络无论相关邻域的数量如何，损失都是相同的。在图5中，我们展示了上述六种网络在邻域全数据集上测试时，其估计值与真实值比较的散点图。这些散点图直观地表明，自监督网络表现出相似的性能，无论网络整合空间信息的能力如何。自监督参数估计值的分布倾向于最小二乘法（LSQ）拟合的分布。相比之下，对于监督学习，Transformer邻域网络和卷积邻域网络的性能优于体素级网络。它们在与监督学习相关的固有偏差方面有显著降低（例如，深色点组成的水平带密度降低）。此外，它们沿着对角线的值的分布范围变窄（以较窄的浅黄色带为例），这表明估计器的准确性有所提高。实际上，当最小二乘法（LSQ）应用于3×3邻域的平均值（右上角），也就是应用于更高信噪比的数据（信号平均），而不是应用于体素级数据（左上角）时，也观察到了类似的分布范围变窄的情况。

Figure

图

Fig. 1. Examples of the synthetic datasets of sub-study 1. (A) Example 3×3 neighborhoods of the neighbors-random test set, where patches were generated with avarying number of neighbors that are correlated (i.e. share identical underlying IVIM parameters) to the center pixel. Each patch in the neighbors-random datasetconsisted of two unique parameter values that were drawn from a uniform distribution for each IVIM parameter. (B) Example 3×3 neighborhoods of the neighbors-Nsubsets, with 'N' denoting the number of neighboring pixels (ranging from 0 to 8) correlated to the center pixel. (C) Example 3×3 neighborhoods of the neighbors-alltest set, where all neighbors are correlated in the patch (i.e. share identical underlying IVIM parameters).

图1. 子研究1中合成数据集的示例。(A) neighbors-random（邻居随机）测试集的3×3邻域示例，其中斑块生成时包含与中心像素相关（即共享相同底层IVIM参数）的随机数量邻居。neighbors-random数据集中的每个斑块包含两个唯一的参数值，这些值针对每个IVIM参数从均匀分布中抽取。(B) neighbors-Nsubsets（邻居N子集）的3×3邻域示例，其中"N"表示与中心像素相关的相邻像素数量（范围从0到8）。(C) neighbors-all（邻居全相关）测试集的3×3邻域示例，其中斑块内所有邻居均与中心像素相关（即共享相同底层IVIM参数）。

Fig. 2. (A) A schematic of how to generate a synthetic fractal-based IVIM parameter map for the brain (example of D). The fractal-based parameter maps containspatially-correlated ground truths, enabling a quantitative evaluation of our proposed method. One fractal-based map is used to provide masks representing thedifferent tissue types (CSF = cerebral spinal fluid, WM = white matter, GM = gray matter). Three other fractal-based maps are scaled to appropriate values for eachtissue type. Then, the masks for each tissue type are applied to the corresponding scaled fractal-based map wherefrom the composite fractal-based IVIM parametermap is generated. Finally (not shown in this figure), IVIM signals are generated from the fractal-based IVIM parameter maps using Eq. (4). The square boxes withinthe masks denote the patches utilized to generate the mask-based patches in (B). (B) A schematic of how to generate synthetic training data in patches (3 examples ofD for 13×13 patches) that possess a varying degree of representativeness for the IVIM fractal-based test set, as described in Section 3.3.1. The 'triangle' symboldepicts the merging of a distribution (uniform or Gaussian) with the spatial orientation of the pixels (binary patch or masks derived from fractal-based maps). Thedifferent datasets are denoted by 'patch type -- distribution type'. Here, the patch type is either 'random' or 'structured', and the distribution type is either 'uniform'or 'gaussian'. The dotted lines in the distributions represent the distribution means of each relevant tissue type. For visualization purposes, the mask-based patchesare derived from the same masks used to create the D* fractal-based map from the test set. However, in our data generation process, these mask-based trainingdatasets feature patches sampled from different fractal-based maps, yet with the same random generation statistics. As a baseline, we also consider 'test-based'patches derived in an identical fashion to the fractal-based test set

图2. (A) 生成脑部合成基于分形的IVIM参数图（以D为例）的示意图。该基于分形的参数图包含空间相关的基准真实数据，可实现对我们提出方法的定量评估。首先使用一个分形图提供不同组织类型的掩膜（CSF=脑脊液，WM=白质，GM=灰质），另外三个分形图根据不同组织类型进行数值缩放。随后将各组织类型的掩膜应用于对应的缩放后分形图，最终合成基于分形的IVIM参数图。最后（未在图中显示），通过公式(4)从分形参数图生成IVIM信号。掩膜中的方框区域表示用于生成(B)中掩膜补丁的样本区域。 (B) 生成具备不同代表性程度的合成训练数据补丁示意图（以13×13的D补丁为例），如第3.3.1节所述。三角形符号表示将分布类型（均匀分布或高斯分布）与空间方位特征（二值补丁或分形图派生的掩膜）相结合。不同数据集通过"补丁类型-分布类型"标记，其中补丁类型分为"随机"或"结构化"，分布类型分为"均匀"或"高斯"。分布中的虚线表示各组织类型的分布均值。为便于可视化，掩膜补丁与测试集D*分形图使用相同掩膜生成，但实际数据生成过程中，这些基于掩膜的训练数据集使用不同分形图采样（具有相同随机生成统计特性）。作为基准，我们还采用与分形测试集生成方式完全相同的"测试基准"补丁。

Fig. 3. Test curves showing the metrics signals-MSE and parameters-MSE for the three networks (voxelwise network, convolution-neighborhood network, andtransformer-neighborhood network) over 10,000 epochs. The networks were trained self-supervised (optimized on signals-MSE) or supervised (optimized onparameters-MSE) on neighbors-random data. For each test curve, the average over every 100 epochs is also displayed with a lighter color, represented by the middleline of each bar in the legend. The plots have been displayed with transparency (alpha = 0.7) to show the overlaid data. The figure shows that incorporating spatialinformation in supervised training improved the final parameters-MSE loss (right). However, this improvement was not observed for self-supervised learning that isoptimized on signals-MSE (left). Moreover, the transformer-neighborhood network converged considerably faster and possessed a lower final loss than theconvolution-neighborhood network (see Fig. 4 for the final losses at 50,000 epochs for neighbors-random). For the self-supervised networks, prolonged trainingresulted in increased parameters-MSE, as was also demonstrated in Kaandorp et al (Kaandorp et al., 2023). Therefore, the self-supervised networks are only visible atearly epochs (epoch 100-400) in the parameters-MSE plot (e.g. light blue curve).

图3. 测试曲线展示了三个网络（体素级网络、卷积邻域网络和Transformer邻域网络）在10,000个训练周期内的指标------信号均方误差（signals-MSE）和参数均方误差（parameters-MSE）。这些网络在邻域随机数据上进行了自监督训练（针对信号均方误差进行优化）或监督训练（针对参数均方误差进行优化）。对于每条测试曲线，每100个训练周期的平均值也会以较浅的颜色显示，在图例中表现为每个长条的中线。这些绘图以透明形式（透明度alpha = 0.7）展示，以便呈现重叠的数据。该图表明，在监督训练中融入空间信息降低了最终的参数均方误差损失（右图）。然而，对于针对信号均方误差进行优化的自监督学习，并未观察到这种改善（左图）。此外，Transformer邻域网络的收敛速度明显更快，并且其最终损失低于卷积邻域网络（关于邻域随机数据在50,000个训练周期时的最终损失，见图4）。对于自监督网络，长时间的训练会导致参数均方误差增加，这一点在坎多尔普等人（Kaandorp等人，2023年）的研究中也得到了证明。因此，在参数均方误差图中，自监督网络仅在训练早期阶段（第100至400个训练周期）可见（例如浅蓝色曲线）。

Fig. 4. Performance of the supervised networks trained on neighbors-random for a 3×3 neighborhood, evaluated for each number of neighbors individually. Theconvolution-neighborhood networks trained specifically for each test subset containing a specific number of correlated neighbors (neighbors-N, yellow), represent theproxies for the optimal expected performance. The green dotted line represents the outcomes for the voxelwise network (fully-connected MLP) with consistent lossacross all test subsets. In certain evaluations, it can be seen that the convolution-neighborhood network trained on neighbors-random outperformed the networktrained on neighbors-N due to potential instability in training, leading to occasional spikes in the loss. The figure shows that the transformer-neighborhood networkprovides comparable or better performance than the alternatives.

图4. 在3×3邻域的邻域随机数据上训练的监督网络的性能，针对每个不同数量的邻域单独进行评估。针对每个包含特定数量相关邻域（邻域数量为N，黄色）的测试子集专门训练的卷积邻域网络，代表了最优预期性能的近似值。绿色虚线表示体素级网络（全连接多层感知机）的结果，其在所有测试子集上的损失保持一致。在某些评估中，可以看到，由于训练中可能存在的不稳定性，在邻域随机数据上训练的卷积邻域网络的表现优于在邻域数量为N的数据上训练的网络，这导致损失偶尔会出现峰值。该图表明，Transformer邻域网络的性能与其他网络相当，甚至更优。

Fig. 5. Scatter plots comparing estimated parameter values against ground truth values for the networks described in Fig. 3 tested on the neighbors-all test set,consisting of 100,000 3×3 patches of DWI signals. Corresponding plots are also shown for voxelwise least squares applied to the center pixel of each patch (top left),and applied to the signal average of each 3×3 neighborhood (top right). All data points are colored by their S0-value, where S0=0 (black) corresponds to SNR=0 andS0=1 (bright yellow) corresponds to SNR=200.

图5：散点图，将图3中所述网络在邻域全数据集（由100,000个3×3的扩散加权成像（DWI）信号块组成）上测试得到的估计参数值与真实参数值进行比较。图中还给出了针对每个信号块中心像素应用体素级最小二乘法（左上角）以及针对每个3×3邻域的信号平均值应用体素级最小二乘法（右上角）的相应散点图。所有数据点都根据其S0值进行了着色，其中S0 = 0（黑色）对应信噪比（SNR） = 0，而S0 = 1（亮黄色）对应信噪比（SNR） = 200。

Fig. 6. Example fractal-based IVIM parameter maps, RMSE maps, and error maps estimated for the transformer-neighborhood networks trained supervised withdifferent receptive fields on random-uniform (see Section 3.2.2.). The networks were trained either with self-attention (Self-X, where X denotes the receptive field) orneighborhood-attention (NATTEN-X). Note that transformer-neighborhood network 'Self-1' is essentially a voxelwise (MLP) equivalent. Corresponding maps of theground truth (left top), least squares and segmented fit are also shown. Overall, the figure shows improved fitting for networks trained with larger receptive fields.

图6：基于分形的体素内不相干运动（IVIM）参数图、均方根误差（RMSE）图以及误差图的示例，这些图是为在均匀随机数据上（详见3.2.2节）通过监督学习训练的、具有不同感受野的Transformer邻域网络所估计得到的。这些网络是使用自注意力机制（Self-X，其中X表示感受野大小）或邻域注意力机制（NATTEN-X）进行训练的。请注意，Transformer邻域网络"Self-1"在本质上等同于一个体素级（多层感知器，MLP）网络。同时也展示了真实值的相应图（左上方）、最小二乘法拟合图以及分割拟合图。总体而言，该图表明，使用更大感受野进行训练的网络拟合效果更好。

Fig. 7. Scatter plots comparing estimated parameter values against groundtruth values for the transformer-neighborhood networks trained supervisedwith different receptive fields on random-uniform (see Section 3.2.2.), andtested on the entire IVIM fractal-based test set (40 fractal-based IVIM parametermaps). The networks were trained either with self-attention (Self-X, where Xdenotes the receptive field) or neighborhood-attention (NATTEN-X). Corresponding plots of the least squares and segmented fit are also shown. All datapoints are colored by their S0-value, where S0 = 0 (black) corresponds to SNR =0 and S0 = 1 (bright yellow) corresponds to SNR = 200. The arrows in the leastsquares plot for D indicate the different tissue types (dark purple = WM, lightpurple* = GM, and orange = CSF). The figure shows that when training involveslarger receptive fields, the fitting is generally improved.

图7：散点图，将在均匀随机数据上（见3.2.2节）通过监督学习、以不同感受野训练的Transformer邻域网络的估计参数值与真实参数值进行比较，且这些网络在整个基于分形体素内不相干运动（IVIM）的测试集（40个基于分形的IVIM参数图）上进行了测试。这些网络是利用自注意力机制（Self-X，其中X表示感受野大小）或邻域注意力机制（NATTEN-X）进行训练的。同时也展示了最小二乘法拟合和分割拟合的相应散点图。所有数据点都根据其S0值进行了着色，其中S0 = 0（黑色）对应信噪比（SNR） = 0，而S0 = 1（亮黄色）对应信噪比（SNR） = 200。在关于扩散系数D的最小二乘法拟合图中，箭头指示了不同的组织类型（深紫色 = 白质（WM），浅紫色 = 灰质（GM），橙色 = 脑脊液（CSF））。该图表明，当训练采用更大的感受野时，拟合效果通常会有所改善。

Fig. 8. Boxplots displaying the median absolute percentage error (MDAPE), median percentage error (MDPE), and mean absolute percentage error (MAPE) for eachIVIM parameter calculated for the transformer-neighborhood networks trained supervised with different receptive fields on random-uniform (See Section 3.2.2.).These error metrics were calculated on the entire IVIM fractal-based test set (40 fractal-noise IVIM parameter maps). To provide insight into network stability, theseerror metrics were computed at each epoch of the last 20 epochs of the entire training. The networks were trained either with self-attention (Self-X, where X denotesthe receptive field) or neighborhood-attention (NATTEN-X). Note that transformer-neighborhood network 'Self-1' is essentially a voxelwise equivalent. The valuesprinted above the plots (blue) are the median values of the boxplots for Self-1 that could not be displayed within the chosen range of the plot.

图8：箱线图展示了针对在均匀随机数据上（详见3.2.2节）通过监督学习、以不同感受野训练的Transformer邻域网络所计算出的每个体素内不相干运动（IVIM）参数的中位绝对百分比误差（MDAPE）、中位百分比误差（MDPE）和平均绝对百分比误差（MAPE）。这些误差指标是在整个基于分形的IVIM测试集（40个基于分形噪声的IVIM参数图）上计算得出的。为了深入了解网络的稳定性，这些误差指标是在整个训练过程的最后20个训练周期的每个周期中计算得到的。这些网络是使用自注意力机制（Self-X，其中X表示感受野大小）或邻域注意力机制（NATTEN-X）进行训练的。请注意，Transformer邻域网络"Self-1"在本质上等同于一个体素级网络。绘制在图上方的数值（蓝色）是Self-1箱线图的中位数值，这些数值无法在所选的绘图范围中显示出来。

Fig. 9. Boxplots displaying the median absolute percentage error (MDAPE), median percentage error (MDPE), and mean absolute percentage error (MAPE) for eachIVIM parameter, calculated on the entire IVIM fractal-based test set (40 fractal-based IVIM parameter maps). To provide insight into network stability, these errormetrics were computed at each epoch of the last 20 epochs of the entire training. The metrics were calculated for transformer-neighborhood networks that weretrained supervised using different synthetic data, ranging from random-uniform (trained independently of the test data) to data that is more representative of the testdata (see Section 3.3.1). The latter involves leveraging information on the underlying spatial variation ('structured-') and/or the underlying distribution of theparameter values ('-gaussian') from the entire IVIM fractal-based test set. The networks were trained with either self-attention (Self-3) or neighborhood-attention(NATTEN-7, NATTEN-17). As a baseline comparison, we also included results for networks trained on synthetic data that had been generated in the same way asthe test set ('test-based').

图9：箱线图展示了在整个基于分形的体素内不相干运动（IVIM）测试集（40个基于分形的IVIM参数图）上计算得出的每个IVIM参数的中位绝对百分比误差（MDAPE）、中位百分比误差（MDPE）以及平均绝对百分比误差（MAPE）。为了深入了解网络的稳定性，这些误差指标是在整个训练过程的最后20个训练周期的每个周期中计算得到的。这些指标是针对通过监督学习训练的Transformer邻域网络计算的，这些网络使用了不同的合成数据进行训练，范围从均匀随机数据（独立于测试数据进行训练）到更能代表测试数据的数据（详见3.3.1节）。后者涉及利用来自整个基于分形的IVIM测试集的基础空间变化信息（"结构化 - "）和/或参数值的基础分布信息（" - 高斯"）。这些网络使用自注意力机制（Self-3）或邻域注意力机制（NATTEN-7，NATTEN-17）进行训练。作为基线比较，我们还纳入了在以与测试集相同方式生成的合成数据（"基于测试数据"）上训练的网络的结果。

Fig. 10. IVIM parameter maps and RMSE maps for a representative slice from the in vivo volunteer data, for one of the twelve repetitions, estimated by thetransformer-neighborhood networks trained supervised with different receptive fields. These networks were trained using either a uniform distribution for eachparameter (random-uniform; left) or a distribution that is more representative of the in vivo data (random-gaussian; right). The networks were trained with either selfattention (Self-1, Self-3) or neighborhood-attention (NATTEN-7, NATTEN-17). Also shown is the b = 0 image (top left), along with corresponding maps for the leastsquares and segmented fit, both on the same single repetition of the in vivo data, as well as on the signal average of all twelve repetitions. The purple arrows indicateregions of lower f in the white matter, while the red arrows indicate regions of higher f, for comparison with the results of the NATTEN-17 networks.

图10：来自体内志愿者数据的一个代表性切片的体素内不相干运动（IVIM）参数图和均方根误差（RMSE）图，该切片取自十二次重复测量中的一次，由通过监督学习、以不同感受野训练的Transformer邻域网络所估计得到。这些网络的训练，要么是对每个参数使用均匀分布（均匀随机分布；左图），要么是使用更能代表体内数据的分布（随机高斯分布；右图）。这些网络采用自注意力机制（Self-1、Self-3）或邻域注意力机制（NATTEN-7、NATTEN-17）进行训练。图中还展示了b = 0时的图像（左上方），以及针对体内数据的同一次重复测量，还有针对所有十二次重复测量的信号平均值的最小二乘法拟合和分割拟合的相应图像。紫色箭头指示了白质中f值较低的区域，而红色箭头指示了f值较高的区域，以便与NATTEN-17网络的结果进行比较。

Fig. 11. Violin plots (with box-and-whisker plots overlaid) of the coefficient of variation (CV) for the transformer-neighborhood networks that were trained supervised with different receptive fields, and applied to the in vivo data. These networks were trained using either a uniform distribution for each parameter (randomuniform; red) or a distribution that is more representative of the in vivo data (random-gaussian; blue). The networks were trained with either self-attention (Self-X,where X denotes the receptive field) or neighborhood-attention (NATTEN-X). For each IVIM parameter, we computed the CV for each voxel across all repetitions andpooled the results for both white matter (left) and gray matter (right) ROIs. Corresponding plots are also shown for least squares and the segmented approach (gray).The outliers of the boxplots (i.e., those data exceeding 1.5 times the interquartile range above the upper quartile and below the lower quartile) are not displayed forvisualization purposes.

图11：通过监督学习、以不同感受野训练的Transformer邻域网络应用于体内数据时的变异系数（CV）的小提琴图（叠加有箱线图）。这些网络的训练，要么是对每个参数使用均匀分布（均匀随机分布；红色），要么是使用更能代表体内数据的分布（随机高斯分布；蓝色）。这些网络采用自注意力机制（Self-X，其中X表示感受野大小）或邻域注意力机制（NATTEN-X）进行训练。对于每个体素内不相干运动（IVIM）参数，我们计算了所有重复测量中每个体素的变异系数，并汇总了白质（左图）和灰质（右图）感兴趣区域（ROI）的结果。图中还展示了最小二乘法和分割法的相应绘图（灰色）。为了便于可视化，箱线图中的异常值（即那些超出上四分位数上方1.5倍四分位距以及下四分位数下方1.5倍四分位距的数据）未显示。

Fig. 12. Direct complement to Fig. 11, showing violin plots of the parameter estimates themselves rather than the coefficients of variation (CV)

图12：作为图11的直接补充，此图展示的是参数估计值本身的小提琴图，而非变异系数（CV）的小提琴图。