ECCV 2014 - LNCS 8689-8695

Semantic Aware Video Transcription Using Random Forest Classifiers

Chen Sun and Ram Nevatia

University of Southern California, Institute for Robotics and Intelligent Systems, Los Angeles, CA 90089, USA

Abstract. This paper focuses on transcription generation in the form of subject, verb, object (SVO) triplets for videos in the wild, given off-the-shelf visual concept detectors. This problem is challenging due to the availability of sentence only annotations, the unreliability of concept detectors, and the lack of training samples for many words. Facing these challenges, we propose a Semantic Aware Transcription (SAT) framework based on Random Forest classifiers. It takes concept detection results as input, and outputs a distribution of English words. SAT uses video, sentence pairs for training. It hierarchically learns node splits by grouping semantically similar words, measured by a continuous skip-gram language model. This not only addresses the sparsity of training samples per word, but also yields semantically reasonable errors during transcription. SAT provides a systematic way to measure the relatedness of a concept detector to real words, which helps us understand the relationship between current visual detectors and words in a semantic space. Experiments on a large video dataset with 1,970 clips and 85,550 sentences are used to demonstrate our idea.

Keywords: Video transcription, random forest, skim-gram language model

LNCS 8689, p. 772 ff.

Full article in PDF | BibTeX