R-CNN

Papers

Because I am not going to implement them… so I just write some notes here…

History: R-CNN -> Fast R-CNN -> Faster R-CNN -> Mask R-CNN

The first paper using CNN: image classification -> object detection

4 steps:

Region proposals (images -> selective search -> top 2000 proposals)
Feature extraction of each propsals (proposals -> AlexNet -> 4096-dim feature vector)
Object category classifiers (feature vector -> k SVM -> k-dim vector -> overlap filter)
Bounding-box regression (linear function of feature vector of P)

Proposal overlap filter:

a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-overunion (IoU) overlap with a higher scoring selected region larger than a learned threshold.

Trainning:

Improvements:

4 steps:

Region proposals (images -> selective search -> top 2000 proposals)
Feature extraction of each propsals (proposals -> VGG + RoI pooling layer -> $H \times W$ feature map)
Object category classifiers (feature map -> FC + (k + background) softmax -> (k+1)-dim vector -> overlap filter)
Bounding-box regression (feature map -> FC -> bbox)

Improvments:

5 steps:

Feature map (images -> VGG -> feature map)
Region proposals (feature map -> Region Proposal Networks -> $(W \times H \times k)$ proposals with scores -> top 2000 + overlap filter)
Feature extraction of each propsals (feature map + proposals -> RoI pooling layer -> $(H \times W)$ feature map)
Object category classifiers (feature map -> FC + (k + background) softmax -> (k+1)-dim vector)
Bounding-box regression (feature map -> FC -> bbox)

RPN:

input: $W \times H$ feature map
k anchor boxes
each boxes has two scores: object or non-object.
output: $W \times H \times(4k)$ coordinates, $W \times H \times (2k)$ scores Training:
Alternating training

Improvments:

backbone: ResNeXt-101+FPN
RoIAlign: use bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average)
Mask branch: FCN (Fully Convolutional Networks for Semantic Segmentation)

FCN:

6 steps:

Feature map (images -> backbone -> feature map)
Region proposals (feature map -> Region Proposal Networks -> $(W \times H \times k)$ proposals with scores -> top 2000 + overlap filter)
Feature extraction of each propsals (feature map + proposals -> RoIAlign -> $(H \times W)$ feature map)
Object category classifiers (feature map -> FC + (k + background) softmax -> (k+1)-dim vector)
Bounding-box regression (feature map -> FC -> bbox)
Mask branch (feature map -> head -> FCN -> k mask proposals)