概述

在地理和制图领域中，经常需要的是准确的矢量多边形而不是光栅化输出，本文针对建筑物的矢量多边形提取，直接从图像中提取建筑顶点并将它们正确连接以创建精确的多边形。

PolyWorld模型利用图神经网络预测每对顶点之间的连接强度，并通过求解一个可微的最优运输问题来估计分配。此外，通过最小化分割和多边形角差损失来优化顶点位置。

介绍

作者认为，现在大多数建筑物提取和多边形化方法依赖于分割网络产生的概率图的矢量化。这些方法不是端到端学习的，这意味着分割模型产生的缺陷和工件将通过整个管道（pipline）进行，从而产生不规则多边形。

因此，作者提出一个新的神经网络模型 PolyWorld，从卫星图像中检测建筑角，并使用学习匹配过程（learned matching procedure）将它们连接起来以形成多边形。

PolyWorld利用CNN网络提取建筑角落的位置和视觉标识符（visual descriptors），并通过评估顶点之间的连接是否有效来生成多边形。这个过程在检测到的顶点描述符之间找到最佳连接分配，这意味着每个角都必须与多边形的后续顶点匹配。多边形顶点之间的连接可以表示为线性和赋值问题的解。同时，利用GNN网络向所以顶点嵌入全局信息来增加标识符的独特性。此外，该模型还对检测到的角点位置进行了细化，使分割和多边形角差损失（combined segmentation and polygonal angle difference loss.）最小化。

PolyWorld架构

PolyWorld的主要思想就是将图像中的建筑多边形表示为根据排列矩阵相连接的一组顶点。模型的输入为图像，输出为模型检测到的建筑物顶点的位置和有效的排列矩阵。

在PolyWorld中，多边形顶点之间的连接用置换矩阵表示，排列矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> P c l o c k P_{clock} </math>Pclock或 <math xmlns="http://www.w3.org/1998/Math/MathML"> P c o u n t P_{count} </math>Pcount的第i行表示与 <math xmlns="http://www.w3.org/1998/Math/MathML"> v i v_i </math>vi相连的下一个顺时针或逆时针顶点的索引。 <math xmlns="http://www.w3.org/1998/Math/MathML"> P c l o c k P_{clock} </math>Pclock是 <math xmlns="http://www.w3.org/1998/Math/MathML"> P c o u n t P_{count} </math>Pcount的转置。

多边形的每个顶点都与排列矩阵的特定行相关联，该排列矩阵表示下一个顺时针顶点。该排列矩阵有以下多边形约束：

每个顶点最多对应一个顺时针连接和一个逆时针连接
顺时针方向多边形的排列矩阵是逆时针方向排列矩阵的转置
排列矩阵上对角线的值可以舍去，因为这不符合现实规定

PolyWorld由三个模块组成：一个顶点检测网络(Vertex detection Network)提取一组可能的候选建筑角，一个注意力图神经网络(attention Graph Neural Network)通过顶点聚合信息并优化它们的位置，以及一个最佳连接网络(Optimal Connection Network)生成顶点之间的连接。

Vertex detection Network

这部分网络输入图像 <math xmlns="http://www.w3.org/1998/Math/MathML"> I ∈ R 3 × H × W I\in R^{3 \times H \times W} </math>I∈R3×H×W，前向传播通过全卷积网络主干（fully convolutional backbone），生成一个D维特征图 <math xmlns="http://www.w3.org/1998/Math/MathML"> F ∈ R D × H × W F \in R^{D \times H \times W} </math>F∈RD×H×W， <math xmlns="http://www.w3.org/1998/Math/MathML"> F F </math>F通过1*1卷积生成顶点生成掩码 <math xmlns="http://www.w3.org/1998/Math/MathML"> Y ∈ R H × W Y\in R^{H \times W} </math>Y∈RH×W，使用kernel=3的非最大抑制算法NMS来对掩码 <math xmlns="http://www.w3.org/1998/Math/MathML"> Y Y </math>Y进行过滤以保留相关峰值。随后将N个峰值的位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> p p </math>p从特征图 <math xmlns="http://www.w3.org/1998/Math/MathML"> F F </math>F中提取N个视觉标识符 <math xmlns="http://www.w3.org/1998/Math/MathML"> d ∈ R D d\in R^D </math>d∈RD，顶点坐标由图像中的坐标组成 <math xmlns="http://www.w3.org/1998/Math/MathML"> p i = ( x , y ) i p_i=(x,y)_i </math>pi=(x,y)i。

Attention Graph Neural

除了建筑角落的位置和视觉外观，考虑其他的语境信息是必不可少的。捕获其位置、外观与图像中其他顶点之间的关系有助于将其具有相同屋顶样式的角、具有匹配的兼容形状和姿势的角或简单的相邻角连接起来。

出于以上考虑，这部分网络通过学习顶点位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> p p </math>p和视觉标识符 <math xmlns="http://www.w3.org/1998/Math/MathML"> d d </math>d的长短期关系来捕捉一组匹配描述符 <math xmlns="http://www.w3.org/1998/Math/MathML"> m i ∈ R D m_i ∈ R^D </math>mi∈RD,此外，网络还估计一个位置偏移 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i ∈ R 2 t_i ∈ R^2 </math>ti∈R2来改进位置顶点。

Vertex Encoder

在通过图神经网络前向传播之前，位置 <math xmlns="http://www.w3.org/1998/Math/MathML"> p p </math>p和视觉标识符 <math xmlns="http://www.w3.org/1998/Math/MathML"> d d </math>d由多层感知器(MLP)合并。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> d i ′ = M L P e n c ( [ d i ∣ ∣ p i ] ) d^′i = MLP{enc} ([di||pi]) </math>di′=MLPenc([di∣∣pi])

<math xmlns="http://www.w3.org/1998/Math/MathML"> M L P e n c MLP_{enc} </math>MLPenc接受 <math xmlns="http://www.w3.org/1998/Math/MathML"> p i p_i </math>pi和 <math xmlns="http://www.w3.org/1998/Math/MathML"> d i d_i </math>di的拼接并返回一个新的标识符 <math xmlns="http://www.w3.org/1998/Math/MathML"> d i ′ ∈ R D d^′_i \in R^D </math>di′∈RD，该标识符将位置信息和视觉信息编码在一起。

Self Attention Network

标识符的聚合是通过自注意力机制在顶点之间传播信息，增加它们的上下文信息实现的，

给定中间描述符 <math xmlns="http://www.w3.org/1998/Math/MathML"> x ∈ R D × N x∈R^{D×N} </math>x∈RD×N，该模型采用线性投影生成查询 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q ( x ) Q(x) </math>Q(x)，键 <math xmlns="http://www.w3.org/1998/Math/MathML"> K ( x ) K(x) </math>K(x)和值 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ( x ) V (x) </math>V(x)。节点之间的权重计算采用softmax，然后将结果与值 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ( x ) V (x) </math>V(x)相乘，以便在所有顶点上传播信息。注意力机制可以写成:
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> A = s o f t m a x ( Q ( x ) ⋅ K ( x ) T n k ) V ( x ) A = softmax(\frac{Q(x)·K(x)^T}{\sqrt{n_k}})V(x) </math>A=softmax(nk Q(x)⋅K(x)T)V(x)

其中归一化项 <math xmlns="http://www.w3.org/1998/Math/MathML"> n k n_k </math>nk是Q和K的维度。

重复该操作L次， <math xmlns="http://www.w3.org/1998/Math/MathML"> A ( L ) ∈ R D × N A^{(L)}\in R^{D\times N} </math>A(L)∈RD×N是在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> L L </math>L层上的结果，将用于更新每一层的顶点标识符
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> x i ( l + 1 ) = M L P ( l ) ( [ x i ( l ) ∣ ∣ a i ( l ) ] ) x^{(l+1)}_i = MLP^{(l)} ([x^{(l)}_i||a^{(l)}_i]) </math>xi(l+1)=MLP(l)([xi(l)∣∣ai(l)])

最后一个注意力层产生的第 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i个顶点的嵌入 <math xmlns="http://www.w3.org/1998/Math/MathML"> x i ( l ) x^{(l)}i </math>xi(l)分解为两个分量：匹配描述符 <math xmlns="http://www.w3.org/1998/Math/MathML"> m i ∈ R D m_i ∈ R^D </math>mi∈RD和位置偏移 <math xmlns="http://www.w3.org/1998/Math/MathML"> t i ∈ R 2 t_i ∈ R^2 </math>ti∈R2
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> m i = M L P m a t c h ( x i ( L ) ) t i = M L P o f f s e t ( x i ( L ) ) m_i = MLP{match}(x^{(L)}i)\\ t_i = MLP{offset}(x^{(L)}_i) </math>mi=MLPmatch(xi(L))ti=MLPoffset(xi(L))

匹配描述符进一步用于生成顶点之间连接的有效组合，而偏移量与顶点位置的组合如下:
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> p i = p i + γ ⋅ t i p_i = p_i + γ · t_i </math>pi=pi+γ⋅ti

Optimal Connection Network

该部分网络目的是连接生成顶点置换矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> P ∈ R N × N P∈R^{N ×N} </math>P∈RN×N，并通过对所有可能的顶点对计算分数矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> S ∈ R N × N S∈R^{N ×N} </math>S∈RN×N，并最大化总分 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ i , j P i , j S i , j ∑{i,j} P{i,j} S_{i,j} </math>∑i,jPi,jSi,j，来获得分配。

给定两个匹配描述符 <math xmlns="http://www.w3.org/1998/Math/MathML"> m i m_i </math>mi和 <math xmlns="http://www.w3.org/1998/Math/MathML"> m j m_j </math>mj编码两个不同顶点的信息，我们利用 <math xmlns="http://www.w3.org/1998/Math/MathML"> M L P c l o c k MLP_{clock} </math>MLPclock来检测顺时针连接 <math xmlns="http://www.w3.org/1998/Math/MathML"> m i → m j m_i→m_j </math>mi→mj是否可能。网络接收两个描述符的连接情况，如果它们之间的连接很强，则返回一个高分值。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> s i → j c l o c k = M L P c l o c k ( [ m i ∣ ∣ m j ] ) s^{clock} {i→j} = MLP{clock} ([m_i||m_j ]) </math>si→jclock=MLPclock([mi∣∣mj])

反之，我们利用 <math xmlns="http://www.w3.org/1998/Math/MathML"> M L P c o u n t MLP_{count} </math>MLPcount来判断逆时针连接 <math xmlns="http://www.w3.org/1998/Math/MathML"> m i → m j m_i→m_j </math>mi→mj的可能性。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> s i → j c o u n t = M L P c o u n t ( [ m i ∣ ∣ m j ] ) s^{count} {i→j} = MLP{count} ([m_i||m_j ]) </math>si→jcount=MLPcount([mi∣∣mj])

通过执行约束2，我们可以在顶点的顺时针和逆时针路径之间建立一致性检查。最终得分矩阵S的计算是顺时针得分矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> S c l o c k S_{clock} </math>Sclock和逆时针得分矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> S c o u n t S_{count} </math>Scount的转置组合:
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> S = S c l o c k + S c o u n t S=S_{clock} + S_{count} </math>S=Sclock+Scount

最后，使用Sinkhorn算法[8,27,29,30]在给定分数矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S的情况下找到最优分配矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P。Sinkhorn是匈牙利算法的GPU高效可微版本，用于解决线性和分配问题，它包括对 <math xmlns="http://www.w3.org/1998/Math/MathML"> e x p ( S ) exp(S) </math>exp(S)的行和列进行一定次数的规范化迭代。

Losses

Detecting

作者使用加权二值交叉熵损失来训练角点检测:
Ldet = − ω ·∑^H_{i=1}∑^W_{j=1} ̄Y_{i,j} · log (Y_{i,j} )−∑^H_{i=1}∑^W_{j=1}(1 − ̄Y_{i,j} ) · log (1 − Y_{i,j} )

<math xmlns="http://www.w3.org/1998/Math/MathML"> Y Y </math>Y是一个稀疏的零数组，表示建筑物角落存在的像素值为1，由于前景像素的分割严重不平衡，使用因子 <math xmlns="http://www.w3.org/1998/Math/MathML"> ω ω </math>ω来平衡正样本。

Matching

Self Attention Network和Optimal Connection Network是可微的，所以可以使用反向传播算法，这条路径是用交叉熵损失从真值置换矩阵P中以监督的方式训练出来的：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L m a t c h = − ∑ i = 1 N ∑ j = 1 N P i , j ⋅ l o g ( P i , j ) L_{match} = −∑^N_{i=1}∑^N_{j=1}P_{i,j} · log (P_{i,j} ) </math>Lmatch=−i=1∑Nj=1∑NPi,j⋅log(Pi,j)

由于Sinkhorn算法通过行和列进行迭代归一化，因此最小化 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P的负对数可以同时最大化匹配的精度和召回率。

由于图像分辨率低、地面真值偏差或错误的建筑物标记，顶点检测网络提供的顶点位置在实践中不是最优的。因此，后续的匹配过程可能会产生与地面真实角不同的多边形，从而改变提取多边形的视觉吸引力。为了抑制这种现象，作者将预测多边形的角与地面真多边形的角之间的差异最小化。

局限性

PolyWorld不能很好的处理建筑之间的共有角和带洞的建筑，前者可以利用多类分割通过检测位于同一位置的顶点数量来解决，后者可能可以通过后处理绑定内环和外环相同形状的信息。

PolyWorld:Polygonal Building Extraction with Graph Neural

概述

介绍