LNCS Homepage
ContentsAuthor IndexSearch

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Kaiming He1, Xiangyu Zhang2, Shaoqing Ren3, and Jian Sun1

1Microsoft Research, China

2Xi’an Jiaotong University, China

3University of Science and Technology, China

Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101.

The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.

LNCS 8691, p. 346 ff.

Full article in PDF | BibTeX


lncs@springer.com
© Springer International Publishing Switzerland 2014