The bag-of-visual-words (BOW) is certainly the most popular image representation to date and it has been shown to yield good results in various problems including Fine- Grained Visual Categorization (FGVC) [3, 4]. Our contribution is to show that the Fisher Vector (FV) – which describes an image by its deviation from an average model – is an alternative which performs much better than the BOW for the FGVC problem. In this extended abstract we first provide a brief introduction to the FV. We then present theoretical as well as practical motivations for using the FV for FGVC. We finally provide experimental results on four ImageNet subsets: fungus, ungulate, vehicle and ImageNet10K.
Compared to  which uses spatial pyramid (SP) BOW representations, we report significantly higher classification accuracies. For instance, on ImageNet10K we report 16.7% vs 6.4% top-1 accuracy which represents a 160%relative improvement.