Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet

Wieland Brendel, Matthias Bethge

Intro

对图片进行的分类是基于count，而不是spatial relationship。

Network Architecture

Bag-of-feature representation与bag-of-words的表达很相似。bag-of-words是将文档中的每一个word进行计数，而word是从词库中抽取。每一个word的count，就集合起来表示一个很长的向量。

类似的，bag-of-feature中是基于一个由visual words组成的词汇库，每一个word都表示一个local image feature集合。那么每一个image就是每一个visual word的count组成的term vector。这个term vector可以给MLP或者SVM这类比较简单的模型。

BoF一个最明显的优势就是易于解释，如果最后使用的是一个linear model。

下面是如何构造模型：

首先使用几个stacked ResNet blocks从image patch ($$q \times q$$) infer到2048 dimensional feature representation。在之后使用linear classifier，来进行预测class / class evidence。将所有patch的class evidence进行平均，来估计整个image的class。这个结构和其他ResNet不同的地方，就是有很多 $$3 \times 3$$ 的convolution变成了 $$1 \times 1$$。并没有专门设定visual words。注意到这里特殊的一点就是每一个local feature representation上，都有一个linear classifier。

Appendix

没有完全读完，但是最关键的visual word here还是需要DNN（ResNet）来构建。

Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet

Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet