【AI论文快讯】关键点定位的空间感知回归：Spatial-Aware Regression for Keypoint Localization

Regression-based keypoint localization shows advantages of high efficiency and better robustness to quantization errors than heatmap-based methods. However, existing regression-based methods discard the spatial location prior in input image with a global pooling, leading to inferior accuracy and are limited to single instance localization tasks. We study the regression-based keypoint localization from a new perspective by leveraging the spatial location prior. Instead of regressing on the pooled feature, the proposed Spatial-Aware Regression (SAR) maintains the spatial location map and outputs spatial coordinates and confidence score for each grid, which are optimized with a unified objective. Benefited by the location prior, these spatial-aware outputs can be efficiently optimized, resulting in better localization performance. Moreover, incorporating spatial prior makes SAR more general and can be applied into various keypoint localization tasks. We test the proposed method in 4 keypoint localization tasks including single/multi-person 2D/3D pose estimation, and the whole-body pose estimation. Extensive experiments demonstrate its promising performance, e.g., consistently outperforming recent regressions-based methods†.

与基于热图的方法相比，基于回归的关键点定位显示出效率高、对量化误差的鲁棒性更好的优势。然而，现有的基于回归的方法在输入图像中丢弃了全局池化之前的空间位置，导致准确性较低，并且仅限于单实例定位任务。我们利用空间位置先验，从一个新的角度研究了基于回归的关键点定位。所提出的空间感知回归（SAR）不是在池化要素上回归，而是维护空间位置地图并输出每个格网的空间坐标和置信度分数，这些分数使用统一的目标进行优化。受益于位置先验，这些空间感知输出可以得到有效优化，从而获得更好的定位性能。此外，结合空间先验使 SAR 更加通用，并且可以应用于各种关键点定位任务。我们在 4 个关键点定位任务中测试了所提出的方法，包括单人/多人 2D/3D 姿态估计和全身姿态估计。广泛的实验证明了它有希望的性能，例如，其性能始终优于最近的基于回归的方法†。

Introduction

Keypoint localization aims to locate target keypoints from an input image and is a fundamental task in the field of computer vision. It has a wide range of applications in human pose estimation [21, 26--28] and facial landmark detection [19], et al. Existing methods for keypoint localization can be summarized into two categories: heatmap-based [21, 29, 31] and regression-based [10, 23, 25], respectively. Regression-based method directly adopts neural network to learn the mapping from input RGB image to keypoint coordinates. Heatmap-based method uses a probability map (also referred as heatmap) to encode the likelihood

关键点定位旨在从输入图像中定位目标关键点，是计算机视觉领域的一项基本任务。它在人体姿态估计 [21， 26--28] 和面部特征点检测 [19] 等中具有广泛的应用。现有的关键点定位方法可以归纳为两类：基于热图 [21， 29， 31] 和基于回归的 [10， 23， 25]。基于回归的方法直接采用神经网络来学习从输入 RGB 图像到关键点坐标的映射。基于热图的方法使用概率图（也称为热图）对似然进行编码

项目代码:https://github.com/kennethwdk/SAR

Figure 1. Illustration of (Top) regression-based method, (Middle) standard heatmap-based method, and (Bottom) the proposed SAR for keypoint localization.

图 1.（上）基于回归的方法、（中）基于标准热图的方法和（下）用于关键点定位的拟议 SAR 的插图。

of the target location and retrieves it by selecting location with the highest probability.

的目标位置，并通过选择概率最高的 Location 来检索它。

As illustrated in Fig. 1, heatmap-based method selects pre-defined points on heatmaps as localization results, which are easy to optimize. However, low-resolution heatmap leads to high quantization error and high-resolution heatmap enlarges the computation and storage cost. Regression-based method is more efficient and robust to quantization error, but is hard to optimize and commonly achieve inferior performance. One reason is that conventional regression destroys the spatial location information of the feature map by a global pooling, thus cannot provide a good initialization for the following regression. This design also limits the application of regression to differentiate and locate multiple keypoints of the same type, e.g., multi-person pose estimation.

如图 1 所示，基于热图的方法选择热图上的预定义点作为定位结果，这些点易于优化。然而，低分辨率热图会导致高量化误差，而高分辨率热图会扩大计算和存储成本。基于回归的方法更有效，对量化误差更鲁棒，但难以优化并且通常性能较差。一个原因是常规回归通过全局池化破坏了 Feature Map 的空间位置信息，因此无法为后续回归提供良好的初始化。这种设计还限制了回归的应用，以区分和定位相同类型的多个关键点，例如，多人姿态估计。

This work is motivated to facilitate the regression-based localization by embedding spatial location prior into regression. Grids in the extracted feature map provide different starting points for regression, making them fitted to locate different keypoints. Regressing from different starting points also introduces duplicate predictions, and we do not know which grids produce the best localization results. Prior works [5, 18] assume the results of grids near the target location are accurate and select them via a separate classification branch. We argue that this heuristic design is not optimal in all cases, e.g., for occluded or truncated keypoints. Moreover, this multi-task learning pipeline introduces optimization inconsistency between classification and regression [2] and is sensitive to many hyperparameters.

这项工作旨在通过在回归之前嵌入空间位置来促进基于回归的定位。提取的特征图中的网格为回归提供了不同的起点，使其适合定位不同的关键点。从不同的起点回归也会引入重复的预测，我们不知道哪些网格会产生最佳的本地化结果。以前的工作 [5， 18] 假设目标位置附近的网格结果是准确的，并通过单独的分类分支选择它们。我们认为，这种启发式设计并非在所有情况下都是最佳的，例如，对于被遮挡或截断的关键点。此外，这种多任务学习管道引入了分类和回归 [2] 之间的优化不一致，并且对许多超参数很敏感。

This work presents the Spatial-Aware Regression (SAR), a novel regression method that effectively utilizes the spatial location prior in input image to generate spatial-aware outputs and automatically select the best prediction. As shown in Fig. 1, SAR regresses coordinates on each grid to utilize its spatial location cues. Benefited by the prior, SAR is able to leverage better starting points. Selecting different starting points also helps to differentiate similar keypoints, making SAR applicable to more challenging tasks like multi-person pose estimation.

这项工作提出了空间感知回归（SAR），这是一种新颖的回归方法，可有效利用输入图像中的空间位置来生成空间感知输出并自动选择最佳预测。如图 1 所示，SAR 回归每个网格上的坐标以利用其空间位置线索。受益于前者，SAR 能够利用更好的起点。选择不同的起点还有助于区分相似的关键点，使 SAR 适用于更具挑战性的任务，例如多人姿态估计。

SAR performs localization on a set of grids, which can be extracted by deep neural networks like CNN [4] or Vision Transformer [31]. To utilize the spatial location prior, SAR introduces a spatial-aware regressor to locate target locations based on spatial location of each grid. To handle duplicate predictions, we propose a spatial-aware selector to evaluate the quality of each regression as confidence score, and select the best prediction. The selector is jointly optimized with the regressor with a unified objective, leading to automatic regression and selection without heuristic design and complex hyperparameters. The introduced selector also depresses the influence of inaccurate predictions. SAR can work well on a low-resolution feature map, thus introduces marginal computational overheads and maintains similar efficiency with existing regression-based methods.

SAR 在一组网格上执行定位，这些网格可以通过 CNN [4] 或 Vision Transformer [31] 等深度神经网络提取。为了利用空间位置先验，SAR 引入了一个空间感知回归器，以根据每个网格的空间位置来定位目标位置。为了处理重复的预测，我们提出了一个空间感知选择器来评估每个回归的质量作为置信度分数，并选择最佳预测。选择器与具有统一目标的回归器联合优化，无需启发式设计和复杂的超参数即可实现自动回归和选择。引入的选择器还抑制了不准确预测的影响。SAR 可以在低分辨率特征图上很好地工作，因此引入了边际计算开销，并与现有的基于回归的方法保持了相似的效率。

SAR shares all merits of conventional regression and surpasses it in many aspects. We test its effectiveness on various keypoint localization tasks including single/multi-person 2D/3D pose estimation and whole-body pose estimation. Extensive experiments on 7 keypoint localization benchmarks demonstrate its superior performance in keypoint localization. For example, SAR obtains 72.5% AP on COCO Keypoint dataset [14], which is higher than conventional regression and heatmap by 16.5% and 1.8%. SAR is robust to various input size and output stride, making it more general to deal with complex scenarios. SAR can also generalize well to detect various types of keypoints, arbitrary number of keypoints, as well as 2D/3D keypoints.

SAR 具有传统回归的所有优点，并在许多方面超越了传统回归。我们测试了它在各种关键点定位任务上的有效性，包括单人/多人 2D/3D 姿态估计和全身姿态估计。对 7 个关键点定位基准测试的广泛实验表明，它在关键点定位方面的卓越性能。例如，SAR 在 COCO Keypoint 数据集上获得了 72.5% 的 AP [14]，比传统的回归和热图分别高出 16.5% 和 1.8%。SAR 对各种输入大小和输出步幅具有鲁棒性，使其更适用于处理复杂场景。SAR 还可以很好地泛化以检测各种类型的关键点、任意数量的关键点以及 2D/3D 关键点。

Related Work

Heatmap-based Keypoint Localization encodes keypoint location with a probability map, which is introduced by [24]. This type of methods estimates heatmaps and retrieves keypoint coordinates with a post-processing operation. Heatmap-based methods dominate the field of keypoint localization because heatmap is easy to learn with CNN. Pioneer works [16, 21, 29] design powerful CNN models to estimate high resolution heatmaps for human pose estimation and facial landmark detection, then the target keypoint can be simply obtained by a post-processing shifting [16, 33]. Due to the limitation of feature map size, some works [18] combine regression and add an offset branch to avoid quantization error. These methods improve the performance of heatmap. However, they rely on high resolution heatmap to locate keypoints, which results in high computation and storage cost.

基于热图的关键点定位（Keypoint Localization）使用概率图对关键点位置进行编码，这是由 [24] 引入的。这种类型的方法估计热图并通过后处理操作检索关键点坐标。基于热图的方法在关键点定位领域占据主导地位，因为热图很容易用 CNN 学习。先锋工作 [16， 21， 29] 设计了强大的 CNN 模型来估计高分辨率热图，用于人体姿势估计和面部特征点检测，然后可以通过后处理偏移简单地获得目标关键点 [16， 33]。由于特征图大小的限制，一些作品 [18] 结合了回归并添加了一个偏移分支以避免量化误差。这些方法可以提高热图的性能。但是，它们依赖高分辨率热图来定位关键点，这会导致高计算和存储成本。

Regression-based Keypoint Localization directly learns the mapping from input image to output coordinates via a neural network, which is adopted by several classical methods [1, 25]. Researchers have proposed many methods to improve the performance of direct regression. The first kind of methods changes the way of regression. Integral pose regression [23] leverages the soft-argmax operation to regress keypoint locations by integrating a latent heatmap, which is proved to be superior to direct regression. Sampling-argmax [11] further improves soft-argmax by minimizing the error between samples drawing from a distribution with groundtruth, avoiding unconstrained probability map in previous method. Some work improves regression by proposing new loss functions. RLE [10] changes the predefined Gaussian or Laplace distribution in commonly used regression loss with a learned distribution via normalizing flow. Recently, researchers also try to improve direct regression by proposing more powerful backbones in Transformer architecture [31, 32], such as TokenPose [12] and PETR [20].

基于回归的关键点定位通过神经网络直接学习从输入图像到输出坐标的映射，这被几种经典方法采用 [1， 25]。研究人员提出了许多方法来提高直接回归的性能。第一种方法改变了回归的方式。积分位姿回归 [23] 利用 soft-argmax 操作通过集成潜在热图来回归关键点位置，这被证明优于直接回归。Sampling-argmax [11] 通过最小化从具有 groundtruth 的分布中提取的样本之间的误差，避免了以前方法中的无约束概率图，从而进一步改进了 soft-argmax。一些工作通过提出新的损失函数来改进回归。RLE [10] 通过归一化流将常用回归损失中预定义的高斯或拉普拉斯分布更改为学习到的分布。最近，研究人员还尝试通过在 Transformer 架构 [31， 32] 中提出更强大的 backbone 来改进直接回归，例如 TokenPose [12] 和 PETR [20]。

Although many regression-based works have been proposed, they ignore the spatial location prior, leading to inferior performance and cannot be applied to multiple keypoints localization tasks. This work shows that embedding the spatial prior into regression significantly improves its performance and generalization capability on various human keypoint localization tasks. A more detailed comparison with heatmap-based and existing regression-based methods is presented in the Sec. 3.3.

虽然已经提出了许多基于回归的工作，但它们忽略了先验的空间位置，导致性能较差，无法应用于多个关键点定位任务。这项工作表明，将空间先验嵌入回归可以显著提高其在各种人类关键点定位任务上的性能和泛化能力。第 3.3 节中提供了基于热图的方法和现有的基于回归的方法的更详细比较。

参考

https://www.computer.org/csdl/proceedings-article/cvpr/2024/530000a624/20hMqbwjkty

https://bytez.com/docs/cvpr/31272/authors

https://ieeexplore.ieee.org/document/10656113

https://cs.pku.edu.cn/info/1264/3108.htm

https://www.cnblogs.com/fanzhongjie/p/12037219.html