Summary of Fast R-CNN Paper

Paper

Key Contributions

Achievements

Fast R-CNN

tab_contents

The above figure illustrates Fast R-CNN architecture. Similar to the R-CNN, category independent region or object proposals are extracted from the image. Then Fast R-CNN network takes entire input image and a set of object proposals. The network first processes the whole image with several convolutional and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (ROI) pooling layer extracts a fixed length feature vector from the feature map. Then each feature vector is fed into a sequence of fully connected (fc) layers.

While training Fast R-CNN, authors initialized the architecture with the pre-trained networks. The initialized pre-trained networks undergoes the following transformations.

Large fully connected layer can be made faster by using truncated SVDs. In object detection, nearly half of the time is spent computing in fully connected layers. Therefore SVDs are used for the speed-up of networks.

Experiments

The experiments use the following ImageNet models

In experiments authors compared Fast RCNN against the other top methods available at that time. They tested the network on the three datasets: VOC07, VOC 2010 and VOC 2012. All the experiments of Fast RCNN use single-scale training and testing.

From the results, we observed that SPPnet uses five scales during both training and testing. The improvement of Fast R-CNN over SPPnet illustrates that even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP (mean average precision (%)).

tab_contents

The above table shows runtime comparision between same models in Fast R-CNN, R-CNN and SPPnet. Fast R-CNN, R-CNN uses single-scale mode. SPPnet uses the multi-scale training, i.e. five scales.

Conclusion

The following main results from experiments supports the paper contributions