ECCV 2014 - LNCS 8689-8695

Video Action Detection with Relational Dynamic-Poselets

Limin Wang^{1, 2}, Yu Qiao², and Xiaoou Tang^{1, 2}

¹Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong
07wanglimin@gmail.com
xtang@ie.cuhk.edu.hk

²Shenzhen Key Lab of CVPR, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
yu.qiao@siat.ac.cn

Abstract. Action detection is of great importance in understanding human motion from video. Compared with action recognition, it not only recognizes action type, but also localizes its spatiotemporal extent. This paper presents a relational model for action detection, which first decomposes human action into temporal “key poses” and then further into spatial “action parts”. Specifically, we start by clustering cuboids around each human joint into dynamic-poselets using a new descriptor. The cuboids from the same cluster share consistent geometric and dynamic structure, and each cluster acts as a mixture of body parts. We then propose a sequential skeleton model to capture the relations among dynamic-poselets. This model unifies the tasks of learning the composites of mixture dynamic-poselets, the spatiotemporal structures of action parts, and the local model for each action part in a single framework. Our model not only allows to localize the action in a video stream, but also enables a detailed pose estimation of an actor. We formulate the model learning problem in a structured SVM framework and speed up model inference by dynamic programming. We conduct experiments on three challenging action detection datasets: the MSR-II dataset, the UCF Sports dataset, and the JHMDB dataset. The results show that our method achieves superior performance to the state-of-the-art methods on these datasets.

Keywords: Action detection, dynamic-poselet, sequential skeleton model

LNCS 8693, p. 565 ff.

Full article in PDF | BibTeX