网络结构没什么好讲的,backbone、neck、head组成,backbone采用的cspdarknet,neck采用的pafpn,head是decoupled head结构。这里主要讲一下label assignment的具体实现,yolox中采用了simota,是ota的简化版本。
实现中标签分配以及计算损失部分是在yolo_head.py中,连带着head的网络层一起的,这里也顺带一起讲了。
首先forward函数的输入xin是neck的输出,当输入shape为(4,3,416,416)时,xin的shape为[(4,128,52,52),(4,256,26,26),(4,512,13,13)],对应8,16,32三种不同stride的输出特征图。
接下里的for循环是分别对三个特征图进行head部分网络层的forward,并计算对应的grids,grids具体是什么后面会讲。以stride=8对应的大小为(4,128,52,52)的特征图为例,self.stems[k]是一层1x1卷积,然后分类分支cls_conv和回归分支reg_conv都是2层3x3卷积,self.cls_preds[k]得到最终的分类输出shape为(b,num_classes,52,52),self.reg_preds[k]得到最终的回归输出shape为(b,4,52,52),self.obj_preds[k]得到最终的objectiveness输出shape为(b,1,52,52)。这里b=4,num_classes=16。
接下来将三个输出torch.cat得到输出shape为(4,21,52,52)。接下来函数self.get_output_and_grid()得到网格坐标grid和解码后的输出output。
代码如下
python
def get_output_and_grid(self, output, k, stride, dtype):
# (4,21,52,52)
grid = self.grids[k]
batch_size = output.shape[0]
n_ch = 5 + self.num_classes
hsize, wsize = output.shape[-2:]
if grid.shape[2:4] != output.shape[2:4]:
yv, xv = meshgrid([torch.arange(hsize), torch.arange(wsize)])
grid = torch.stack((xv, yv), 2).view(1, 1, hsize, wsize, 2).type(dtype) # (1,1,52,52,2), 先按行后按列,每个像素点的坐标
self.grids[k] = grid
output = output.view(batch_size, 1, n_ch, hsize, wsize) # (4,1,21,52,52)
output = output.permute(0, 1, 3, 4, 2).reshape(
batch_size, hsize * wsize, -1
) # (4,2704,21)
grid = grid.view(1, -1, 2) # (1,2704,2)
output[..., :2] = (output[..., :2] + grid) * stride
output[..., 2:4] = torch.exp(output[..., 2:4]) * stride
return output, grid
self.grids是三个torch.Size([1])的列表,所以会进入到line8的if中。hsize和wsize分别是特征图的高和宽这里都是52,meshgrid返回的yv和xv分别是特征图每个像素点对应的y坐标和x坐标,如下所示
python
tensor([[ 0, 0, 0, ..., 0, 0, 0],
[ 1, 1, 1, ..., 1, 1, 1],
[ 2, 2, 2, ..., 2, 2, 2],
...,
[49, 49, 49, ..., 49, 49, 49],
[50, 50, 50, ..., 50, 50, 50],
[51, 51, 51, ..., 51, 51, 51]])
tensor([[ 0, 1, 2, ..., 49, 50, 51],
[ 0, 1, 2, ..., 49, 50, 51],
[ 0, 1, 2, ..., 49, 50, 51],
...,
[ 0, 1, 2, ..., 49, 50, 51],
[ 0, 1, 2, ..., 49, 50, 51],
[ 0, 1, 2, ..., 49, 50, 51]])
然后将xy坐标stack得到每个点的xy坐标,shape为(1,1,52,52,2),按先行后列的顺序,如下
python
tensor([[[[[ 0., 0.],
[ 1., 0.],
[ 2., 0.],
...,
[49., 0.],
[50., 0.],
[51., 0.]],
[[ 0., 1.],
[ 1., 1.],
[ 2., 1.],
...,
[49., 1.],
[50., 1.],
[51., 1.]],
...,
[[ 0., 50.],
[ 1., 50.],
[ 2., 50.],
...,
[49., 50.],
[50., 50.],
[51., 50.]],
[[ 0., 51.],
[ 1., 51.],
[ 2., 51.],
...,
[49., 51.],
[50., 51.],
[51., 51.]]]]], device='cuda:0', dtype=torch.float16)
然后将output和grid分别view调整维度,output中每个点对应一个预测框,output[..., :2]是预测框中心点相对于每个点的偏移,因此line8加上每个点的坐标grid并乘以stride还原回原图上得到原图上真实预测框的中心点坐标。line9则是通过 \(e^{t}\) 并乘以stride得到原图上真实预测框的宽高。
在得到原图上预测框的坐标以及类别和objectiveness后,接下来就是进行label assignment并计算loss,具体实现都在self.get_losses()中。其中输入outputs是将坐标还原到原图中的三个特征图的输出并concat得到的,shape为(b, 3549, 21),3549=52x52+26x26+13x13,21=4+1+16。
在函数get_losses()中,调用self.get_assignments进行标签分配,这里使用的方法是simota。关于simota和ota的原理可参考OTA: Optimal Transport Assignment for Object Detection 原理与代码解读-CSDN博客和https://blog.csdn.net/ooooocj/article/details/136569249。get_assignments()的完整实现如下
python
@torch.no_grad()
def get_assignments(
self,
batch_idx,
num_gt,
gt_bboxes_per_image, # (17,4)
gt_classes, # (17)
bboxes_preds_per_image, # (3549,4)
expanded_strides, # (1,3549)
x_shifts, # (1,3549)
y_shifts, # (1,3549)
cls_preds, # (4,3549,16)
obj_preds, # (4,3549,1)
mode="gpu",
):
if mode == "cpu":
print("-----------Using CPU for the Current Batch-------------")
gt_bboxes_per_image = gt_bboxes_per_image.cpu().float()
bboxes_preds_per_image = bboxes_preds_per_image.cpu().float()
gt_classes = gt_classes.cpu().float()
expanded_strides = expanded_strides.cpu().float()
x_shifts = x_shifts.cpu()
y_shifts = y_shifts.cpu()
fg_mask, geometry_relation = self.get_geometry_constraint(
gt_bboxes_per_image,
expanded_strides,
x_shifts,
y_shifts,
) # (3549), (17,357)
# fg_mask中True位置的anchor point至少在一个gt box的center area内,后续会用来进行label assignment。而不在fg_mask中False位置的anchor point
# 不在任意一个gt box的center area内。
bboxes_preds_per_image = bboxes_preds_per_image[fg_mask] # (357,4)
cls_preds_ = cls_preds[batch_idx][fg_mask] # (357,16)
obj_preds_ = obj_preds[batch_idx][fg_mask] # (357,1)
num_in_boxes_anchor = bboxes_preds_per_image.shape[0] # 357
if mode == "cpu":
gt_bboxes_per_image = gt_bboxes_per_image.cpu()
bboxes_preds_per_image = bboxes_preds_per_image.cpu()
pair_wise_ious = bboxes_iou(gt_bboxes_per_image, bboxes_preds_per_image, False) # (17,357)
gt_cls_per_image = (
F.one_hot(gt_classes.to(torch.int64), self.num_classes)
.float()
) # (17,16)
pair_wise_ious_loss = -torch.log(pair_wise_ious + 1e-8) # (17,357)
if mode == "cpu":
cls_preds_, obj_preds_ = cls_preds_.cpu(), obj_preds_.cpu()
with torch.cuda.amp.autocast(enabled=False):
cls_preds_ = (
cls_preds_.float().sigmoid_() * obj_preds_.float().sigmoid_()
).sqrt()
pair_wise_cls_loss = F.binary_cross_entropy(
cls_preds_.unsqueeze(0).repeat(num_gt, 1, 1), # (357,16)->(1,357,16)->(17,357,16)
gt_cls_per_image.unsqueeze(1).repeat(1, num_in_boxes_anchor, 1), # (17,16)->(17,1,16)->(17,357,16)
reduction="none"
).sum(-1) # (17,357), 共16个类别,每个类单独计算bce
del cls_preds_
cost = (
pair_wise_cls_loss
+ 3.0 * pair_wise_ious_loss
+ float(1e6) * (~geometry_relation) # center area之外的anchor point对应的cost加上一个很大的值来过滤
) # (17,357)
(
num_fg, # 22
gt_matched_classes, # (22)
pred_ious_this_matching, # (22)
matched_gt_inds, # (22)
) = self.simota_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask)
del pair_wise_cls_loss, cost, pair_wise_ious, pair_wise_ious_loss
if mode == "cpu":
gt_matched_classes = gt_matched_classes.cuda()
fg_mask = fg_mask.cuda()
pred_ious_this_matching = pred_ious_this_matching.cuda()
matched_gt_inds = matched_gt_inds.cuda()
return (
gt_matched_classes,
fg_mask, # (3549)
pred_ious_this_matching,
matched_gt_inds,
num_fg,
)
在ota中使用了center prior,即只有gt box中心有限区域内的anchor point作为正样本的candidate,而不是整个gt box内所有的anchor point都作为正样本的候选。函数get_geometry_constraint就是实现这个的
python
def get_geometry_constraint(
self, gt_bboxes_per_image, expanded_strides, x_shifts, y_shifts,
):
"""
Calculate whether the center of an object is located in a fixed range of
an anchor. This is used to avert inappropriate matching. It can also reduce
the number of candidate anchors so that the GPU memory is saved.
"""
expanded_strides_per_image = expanded_strides[0] # (3549)
x_centers_per_image = ((x_shifts[0] + 0.5) * expanded_strides_per_image).unsqueeze(0) # (1,3549)
y_centers_per_image = ((y_shifts[0] + 0.5) * expanded_strides_per_image).unsqueeze(0) # (1,3549)
# in fixed center
center_radius = 1.5 # 这里有可能center area区域比原目标还大
center_dist = expanded_strides_per_image.unsqueeze(0) * center_radius # (1,3549)
gt_bboxes_per_image_l = (gt_bboxes_per_image[:, 0:1]) - center_dist # (17,1) -> (17,3549)
gt_bboxes_per_image_r = (gt_bboxes_per_image[:, 0:1]) + center_dist # (17,3549)
gt_bboxes_per_image_t = (gt_bboxes_per_image[:, 1:2]) - center_dist # (17,3549)
gt_bboxes_per_image_b = (gt_bboxes_per_image[:, 1:2]) + center_dist # (17,3549)
c_l = x_centers_per_image - gt_bboxes_per_image_l # (1,3549)-(17,3549) -> (17,3549)
c_r = gt_bboxes_per_image_r - x_centers_per_image
c_t = y_centers_per_image - gt_bboxes_per_image_t
c_b = gt_bboxes_per_image_b - y_centers_per_image
center_deltas = torch.stack([c_l, c_t, c_r, c_b], 2) # (17,3549,4)
is_in_centers = center_deltas.min(dim=-1).values > 0.0 # (17,3549)
anchor_filter = is_in_centers.sum(dim=0) > 0 # (3549), 一共3549个anchor point, 对应位置为False, 说明这个anchor point不在任意一个gt box的center area内
geometry_relation = is_in_centers[:, anchor_filter] # (17,357), anchor_filter.sum()==357,表明某个anchor point至少在一个gt box的center area内
return anchor_filter, geometry_relation
最终返回的anchor_filter是一个shape为(3549, )的tensor,值全为True或False。前面说过三个特征图一共3549个anchor point,值为False对应的anchor point不在任意一个gt box的center area内,后续进行标签分配时只从值为True的anchor point中挑选。当我用自己的数据调试时,另一个输出geometry_relation的shape为(17, 357),17是图中gt的数量,357是anchor_filter中值为True的anchor point的数量,geometry_relation表示每个gt的中心区域内对应的anchor point。
然后用fg_mask也就是anchor_filter挑选出候选的正样本,然后计算ota的cost matrix,cost矩阵包括分类损失以及回归损失,注意分类的预测要取sigmoid后并与obj预测相乘再与gt计算交叉熵损失,最后加上float(1e6) * (~geometry_relation)
是对每个gt中心区域外的anchor加上一个特别大的cost,从而过滤它们。
在得到cost矩阵后,就是通过simota进行标签分配的过程了,具体实现在函数simota_matching中
python
def simota_matching(self, cost, pair_wise_ious, gt_classes, num_gt, fg_mask):
# (17,357),(17,357),(17),17,(3549)
matching_matrix = torch.zeros_like(cost, dtype=torch.uint8) # (17,357)
n_candidate_k = min(10, pair_wise_ious.size(1)) # 这里10就是文章中dynamic_k中的q
topk_ious, _ = torch.topk(pair_wise_ious, n_candidate_k, dim=1) # (17,10)
dynamic_ks = torch.clamp(topk_ious.sum(1).int(), min=1) # (17)
# tensor([3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:0', dtype=torch.int32)
# 每个gt选择q个最大iou值,相加取整作为为该gt分配的anchor point的个数
for gt_idx in range(num_gt):
_, pos_idx = torch.topk(
cost[gt_idx], k=dynamic_ks[gt_idx], largest=False
) # 选择cost最小的dynamic k个anchor point作为分配的正样本,代替原始OTA中的sinkhorn算法
matching_matrix[gt_idx][pos_idx] = 1
del topk_ious, dynamic_ks, pos_idx
anchor_matching_gt = matching_matrix.sum(0) # (357)
# deal with the case that one anchor matches multiple ground-truths
if anchor_matching_gt.max() > 1:
multiple_match_mask = anchor_matching_gt > 1
_, cost_argmin = torch.min(cost[:, multiple_match_mask], dim=0) # 当一个anchor point匹配多个gt时,选择cost最小的gt作为匹配的结果
matching_matrix[:, multiple_match_mask] *= 0
matching_matrix[cost_argmin, multiple_match_mask] = 1
fg_mask_inboxes = anchor_matching_gt > 0 # (357), pos anchor point的index
num_fg = fg_mask_inboxes.sum().item()
# num_fg==22, anchor_matching_gt.sum()==tensor(22, device='cuda:0')
# 当if anchor_matching_gt.max() > 1成立时,num_fg > matching_matrix.sum().item()
# fg_mask.sum().item() == 357
fg_mask[fg_mask.clone()] = fg_mask_inboxes
# fg_mask.sum().item() == 22
# 更新fg_mask,本来fg_mask中有357个anchor point初步过滤后再gt center area内,然后经过simota第二次匹配找到pos anchor point
# 注意这里[]内fg_mask.clone()的作用,是找到那357个的值,然后用fg_mask_inboxes替换
# 这里fg_mask更新后,不用return,外面的fg_mask也更新了
# 这里的fg_mask就是所有3549个anchor中哪几个anchor是正样本,正样本处的值为1
matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)
# matched_gt_inds == tensor([7, 7, 8, 2, 1, 9, 4, 4, 10, 10, 3, 13, 5, 14, 12, 6, 15, 16, 11, 0, 0, 0], device='cuda:0')
# 每个pos anchor匹配到了第几个gt的index
# print(gt_classes) == tensor([6, 5, 12, 12, 12, 5, 12, 12, 5, 12, 12, 5, 5, 5, 5, 12, 12], device='cuda:0', dtype=torch.float16)
gt_matched_classes = gt_classes[matched_gt_inds] # 每个pos anchor匹配到的gt的实际类别索引
# print(gt_matched_classes) == tensor([12, 12, 5, 12, 5, 12, 12, 12, 12, 12, 12, 5, 5, 5, 5, 12, 12, 12, 5, 6, 6, 6.], device='cuda:0', dtype=torch.float16)
pred_ious_this_matching = (matching_matrix * pair_wise_ious).sum(0)[
fg_mask_inboxes
]
# 这里sum(0)沿列求和,一列只有1个值大于0,因为上面处理完后,一个anchor只能匹配一个gt。但一行可以有多个大于0的值,即1个gt可以和多个anchor匹配
return num_fg, gt_matched_classes, pred_ious_this_matching, matched_gt_inds
simota和原本的ota的区别是,在得到cost矩阵后,ota通过sinkhorn算法进行匹配,而simota则直接选择topk个cost最小的anchor作为正样本,和最早的faster rcnn中的topk相似,只不过那里是选择iou最小,这里是选择cost最小,这里的cost不仅考虑了iou还考虑了分类损失和center prior。另外这里的k不是认为设置的固定值,而是dynamic k,具体是根据先选择q个iou最大的anchor(这里q仍然是人工设定的代码中取10),然后这10个iou求和取整得到k值。
python
n_candidate_k = min(10, pair_wise_ious.size(1)) # 这里10就是文章中dynamic_k中的q
topk_ious, _ = torch.topk(pair_wise_ious, n_candidate_k, dim=1) # (17,10)
dynamic_ks = torch.clamp(topk_ious.sum(1).int(), min=1) # (17)
一个gt可以匹配多个anchor,但一个anchor只能匹配一个gt,根据上面的规则选择cost最小的k个anchor后如果存在一个anchor匹配多个gt的情况,选择cost最小对应的gt作为匹配结果。
样本分配完后,就是计算损失了,这里没什么好讲的,回归损失采用的iou loss,分类损失和obj损失都是bce loss。yolox中作者在最后15个epoch关闭了mosaic数据增强,并添加了额外的L1 loss来增加回归的精度,这里L1 loss就是在特征图上计算的,预测就是特征图的原始输出,没有像iou loss一样加上grid并乘以stride映射会原图,这里target是将label反向映射到特征图上。
python
def get_l1_target(self, l1_target, gt, stride, x_shifts, y_shifts, eps=1e-8):
l1_target[:, 0] = gt[:, 0] / stride - x_shifts
l1_target[:, 1] = gt[:, 1] / stride - y_shifts
l1_target[:, 2] = torch.log(gt[:, 2] / stride + eps)
l1_target[:, 3] = torch.log(gt[:, 3] / stride + eps)
return l1_target