深度学习入门(4) -Object Detection 目标检测

Object Detection

Output:

  1. category label from fixed, known set of categories
  2. bounding box (x, y, width, height)

If only one object is needed to be detected -> add FC layer to the Net pretrianed on ImageNet

Sliding Window

apply a CNN to many different crops of the image, CNN classifies each crop as object / backgroud

but too many windows!! and may detect repeatedly

we need region proposals to find a small set of boxes that are likely to cover all the objects

"Selective Search" quick to generate 2000 regions

R-CNN : Region-Based CNN

  1. Region proposals
  2. warped the image to fixed size 224*224
  3. forward each region through ConvNet independently
  4. output a classification score and also a Bbox of 4 numbers, using the following algorithm
Measurement of boxes (IoU)

I o U = Area of Intersection Area of Union IoU = \frac{\text{Area of Intersection}}{\text{Area of Union}} IoU=Area of UnionArea of Intersection

I o U > 0.5 IoU > 0.5 IoU>0.5 is decent

I o U > 0.7 IoU > 0.7 IoU>0.7 pretty good

I o U > 0.9 IoU > 0.9 IoU>0.9 perfect

Overlapping Boxes: Non-Max Suppression (NMS)
  1. select next highest-scoring box
  2. eliminate lower-scoring boxes with IoU>0.7 (with the box we selected in step1)
  3. If any boxes remain goto 1

Evaluating Object Detectors: mAP(Mean Average Precision)

  1. run detector on all test images + NMS

  2. for each category, computer AP = area under precision vs Recall Curve

    复制代码
     1.	for each detection (high -> low)
     	1.	If it matches some GT(Ground-Truth) box with IoU>0.5 mark it as positive and eliminate the GT
     	2.	otherwise mark is as nagative
     	3.	plot a point on PR curve
     2.	AP = area under PR Curve
  3. mAP = average of AP for each category

  4. COCO mAP: compute mAP for each IoU threshold and take average

How to get AP = 1.0 -> hit all GT boxes with IoU > 0.5, no false positive ranked above any true positive

Fast R-CNN

  1. ConvNet (Backbone network)-> convolutional features for entire high resolution image
  2. Regions of Interest (Rols)
  3. Crop + Resize features
  4. Per-Region Network (light-weight -> fast)
  5. output category and box

Cropping Features: Rol Pool

  1. project proposal onto features
  2. snap to gird cells
  3. divide into 2*2 gird of (roughly) equal subregions
  4. max-pool within each subregions
  5. output the region features (always the same size even if we have different sizes of input regions)

Rol Align

Rol Align -> better align to avoid snapping

Faster R-CNN

Insert Region Proposal Network (RPN) to predict proposals from features

after the backbone network -> RPN -> regional proposals

Imagine an anchor box of fixed size at each point in the feature map

At each point predict whether the corresponding anchor contains an object

for positive boxes, also predict a box transform to regress from anchor box to object box

Use k different anchor boxes at each point

Single stage Faster R-CNN

just use anchor to make classification and object boxes predictions

Semantic Segmentation: Fully Convolutional Network

Input -> Convolutions -> Scores C * H * W -> argmax H * W

use cross-entropy loss of every pixel to train the network

Trick: Downsampling and Upsampling

Downsampling : Pooling, strided convolution

Upsampling

Unpooling

Bed of nails : fill 0

Nearest Neighbour: same numbers in small blocks

Bilinear Interpolation

f x , y = ∑ i , j f i , j max ⁡ ( 0 , 1 − ∣ x − i ∣ ) max ⁡ ( 0 , 1 − ∣ y − j ∣ ) f_{x,y} = \sum_{i,j}{f_{i,j} \max(0, 1-|x-i|) \max(0,1-|y-j|)} fx,y=∑i,jfi,jmax(0,1−∣x−i∣)max(0,1−∣y−j∣)

i,j in Nearest neighbours

Use two closest neighbours in x and y to construct linear approximations

Bicubic Interpolation

three closest neighbours in x and y to construct cubic approximation

Max Unpooling
Learnable Upsampling

Mask R-CNN

Just add Conv layers to predict a mask for each of C classes on the region proposals

Panoptic Segmentation

speperate different objects in the same category

Human Keypoints

Represent the pose of a human by locating a set of keypoints

Joint Instance Segmentation and Pose Estimation

-> General Idea: Add Per-Region "Heads" to Faster / Mask R-CNN

Dense captioning -> nlp -> visual reasoning

3D shape prediction ...

相关推荐
m0_641889292 分钟前
GEO优化监测:品牌如何靠GEO挖掘可靠信源,提升AI搜索曝光获客
人工智能·geo·数字营销·ai搜索·智能营销·geo优化·geo平台
一次旅行2 分钟前
AI 技术热点新闻简报|2026-05-30
大数据·人工智能
aqi004 分钟前
15天学会AI应用开发(三)把历史对话作为提示词会怎样
人工智能·python·大模型·ai编程·ai应用
俯首甘为孺子刘x4 分钟前
AI时代的焦虑与思考
人工智能·ai编程·codex·ai-agent
北京耐用通信6 分钟前
耐达讯自动化PROFIBUS光纤模块:工业通信的“光电翻译官”
人工智能·科技·网络协议·自动化·信息与通信
青风977 分钟前
YOLO-World:实时开放词汇对象检测(YOLO-World: Real-Time Open-Vocabulary Object Detection)
人工智能·yolo·目标检测
日月新著8 分钟前
本地部署AI Agent实现GEO自动化效果追踪的技术方案
人工智能·自动化
蒟蒻的贤9 分钟前
深度学习底层核心原理:损失函数、梯度与参数更新
人工智能·深度学习
程序猿阿伟10 分钟前
《OpenClaw行为审计与追溯系统设计》
人工智能