A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
用一个网络模型就可以完成边界预测和分类
Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
非常快,实时;基础模型可以45帧每秒,小版本的模型甚至可以达到155帧每秒
介绍
Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
在YOLO的系统里,只要看一眼就能预测出坐标和类别,就是一步到位。
三大优势
First, YOLO is extremely fast. Since we frame detection as a regression problem we don't need a complex pipeline.
非常快
Second, YOLO reasons globally about the image when making predictions.
可以分析整个图片
Third, YOLO learns generalizable representations of objects.
泛化能力强
统一检测
This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
需要预测的图片在进入之前会被分割成S*S个格子
Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr(Object) ∗ IOUtruth pred . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell.
每个bbox包含了x,y,w,h和置信值。confidence是iou计算的值。
Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
每个格子都预测一个类型
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
训练选择了448*448的图片大小
激活函数
We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
这里是说到最后一层用的是线性激活函数,其他层用的是Leakey Relu。
损失函数
λ c o o r d λ_{coord} λcoord=5是使负责检测物体的变大, λ n o o b j λ_{noobj} λnoobj=0.5是使不负责检测物体的变小。
S 2 S^2 S2是分割的格子数量,B是每个格子的预测框的数量。
1 i j o b j 1^{obj}_{ij} 1ijobj是第几个格子的第几个预测框是否有对象,有是1,否则0。
This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.
预测的数量比较少
对于小目标难以预测
This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.
表征比较粗糙
This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.