深度学习入门(4) -Object Detection 目标检测

Object Detection

Output:

  1. category label from fixed, known set of categories
  2. bounding box (x, y, width, height)

If only one object is needed to be detected -> add FC layer to the Net pretrianed on ImageNet

Sliding Window

apply a CNN to many different crops of the image, CNN classifies each crop as object / backgroud

but too many windows!! and may detect repeatedly

we need region proposals to find a small set of boxes that are likely to cover all the objects

"Selective Search" quick to generate 2000 regions

R-CNN : Region-Based CNN

  1. Region proposals
  2. warped the image to fixed size 224*224
  3. forward each region through ConvNet independently
  4. output a classification score and also a Bbox of 4 numbers, using the following algorithm
Measurement of boxes (IoU)

I o U = Area of Intersection Area of Union IoU = \frac{\text{Area of Intersection}}{\text{Area of Union}} IoU=Area of UnionArea of Intersection

I o U > 0.5 IoU > 0.5 IoU>0.5 is decent

I o U > 0.7 IoU > 0.7 IoU>0.7 pretty good

I o U > 0.9 IoU > 0.9 IoU>0.9 perfect

Overlapping Boxes: Non-Max Suppression (NMS)
  1. select next highest-scoring box
  2. eliminate lower-scoring boxes with IoU>0.7 (with the box we selected in step1)
  3. If any boxes remain goto 1

Evaluating Object Detectors: mAP(Mean Average Precision)

  1. run detector on all test images + NMS

  2. for each category, computer AP = area under precision vs Recall Curve

    复制代码
     1.	for each detection (high -> low)
     	1.	If it matches some GT(Ground-Truth) box with IoU>0.5 mark it as positive and eliminate the GT
     	2.	otherwise mark is as nagative
     	3.	plot a point on PR curve
     2.	AP = area under PR Curve
  3. mAP = average of AP for each category

  4. COCO mAP: compute mAP for each IoU threshold and take average

How to get AP = 1.0 -> hit all GT boxes with IoU > 0.5, no false positive ranked above any true positive

Fast R-CNN

  1. ConvNet (Backbone network)-> convolutional features for entire high resolution image
  2. Regions of Interest (Rols)
  3. Crop + Resize features
  4. Per-Region Network (light-weight -> fast)
  5. output category and box

Cropping Features: Rol Pool

  1. project proposal onto features
  2. snap to gird cells
  3. divide into 2*2 gird of (roughly) equal subregions
  4. max-pool within each subregions
  5. output the region features (always the same size even if we have different sizes of input regions)

Rol Align

Rol Align -> better align to avoid snapping

Faster R-CNN

Insert Region Proposal Network (RPN) to predict proposals from features

after the backbone network -> RPN -> regional proposals

Imagine an anchor box of fixed size at each point in the feature map

At each point predict whether the corresponding anchor contains an object

for positive boxes, also predict a box transform to regress from anchor box to object box

Use k different anchor boxes at each point

Single stage Faster R-CNN

just use anchor to make classification and object boxes predictions

Semantic Segmentation: Fully Convolutional Network

Input -> Convolutions -> Scores C * H * W -> argmax H * W

use cross-entropy loss of every pixel to train the network

Trick: Downsampling and Upsampling

Downsampling : Pooling, strided convolution

Upsampling

Unpooling

Bed of nails : fill 0

Nearest Neighbour: same numbers in small blocks

Bilinear Interpolation

f x , y = ∑ i , j f i , j max ⁡ ( 0 , 1 − ∣ x − i ∣ ) max ⁡ ( 0 , 1 − ∣ y − j ∣ ) f_{x,y} = \sum_{i,j}{f_{i,j} \max(0, 1-|x-i|) \max(0,1-|y-j|)} fx,y=∑i,jfi,jmax(0,1−∣x−i∣)max(0,1−∣y−j∣)

i,j in Nearest neighbours

Use two closest neighbours in x and y to construct linear approximations

Bicubic Interpolation

three closest neighbours in x and y to construct cubic approximation

Max Unpooling
Learnable Upsampling

Mask R-CNN

Just add Conv layers to predict a mask for each of C classes on the region proposals

Panoptic Segmentation

speperate different objects in the same category

Human Keypoints

Represent the pose of a human by locating a set of keypoints

Joint Instance Segmentation and Pose Estimation

-> General Idea: Add Per-Region "Heads" to Faster / Mask R-CNN

Dense captioning -> nlp -> visual reasoning

3D shape prediction ...

相关推荐
比尔盖茨的大脑23 分钟前
AI Agent 架构设计:从 ReAct 到 Multi-Agent 系统
前端·人工智能·全栈
数据智能老司机29 分钟前
PyTorch 深度学习——它始于一个张量
pytorch·深度学习
后端小肥肠1 小时前
OpenClaw 实战|多 Agent 打通小红书:数据收集 + 笔记编写 + 自动发布一步到位
人工智能·aigc·agent
银河系搭车客指南1 小时前
OpenClaw 多 Agent 实战指南:Multi-Agent Routing 与 Sub-Agents 的正确打开方式
人工智能
手机不死我是天子1 小时前
拆解大模型二:Transformer 最核心的设计,其实你高中就学过
人工智能·llm
gustt1 小时前
MCP协议进阶:构建多工具Agent实现智能查询与浏览器交互
人工智能·agent·mcp
Halo咯咯1 小时前
Claude Code 的工程哲学:缓存与工具设计的真实教训 | 经验分享
人工智能
风象南2 小时前
最适合新手先装的 20 个 OpenClaw Skills 来了!
人工智能
小兵张健13 小时前
35岁程序员的春天来了
人工智能
大怪v14 小时前
AI抢饭?前端佬:我要验牌!
前端·人工智能·程序员