ECCV 2014 - LNCS 8689-8695

Action Recognition with Stacked Fisher Vectors

Xiaojiang Peng^{1, 3, 2}, Changqing Zou^{3, 2}, Yu Qiao^{2, 4}, and Qiang Peng¹

¹Southwest Jiaotong University, Chengdu, China

²Shenzhen Key Lab of CVPR, Shenzhen Institutes of Advanced Technology, CAS, China

³Department of Computer Science, Hengyang Normal University, Hengyang, China

⁴The Chinese University of Hong Kong, China

Abstract. Representation of video is a vital problem in action recognition. This paper proposes Stacked Fisher Vectors (SFV), a new representation with multi-layer nested Fisher vector encoding, for action recognition. In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The second layer compresses the FVs of subvolumes obtained in previous layer, and then encodes them again with Fisher vectors. Compared with standard FV, SFV allows refining the representation and abstracting semantic information in a hierarchical way. Compared with recent mid-level based action representations, SFV need not to mine discriminative action parts but can preserve mid-level information through Fisher vector encoding in higher layer. We evaluate the proposed methods on three challenging datasets, namely Youtube, J-HMDB, and HMDB51. Experimental results demonstrate the effectiveness of SFV, and the combination of the traditional FV and SFV outperforms state-of-the-art methods on these datasets with a large margin.

Keywords: Action recognition, Fisher vectors, stacked Fisher vectors, max-margin dimensionality reduction

LNCS 8693, p. 581 ff.

Full article in PDF | BibTeX