ECCV 2014 - LNCS 8689-8695

Linking People in Videos with “Their” Names Using Coreference Resolution^*

Vignesh Ramanathan¹, Armand Joulin², Percy Liang², and Li Fei-Fei²

¹Department of Electrical Engineering, Stanford University, USA
vigneshr@cs.stanford.edu

²Computer Science Department, Stanford University, USA
ajoulin@cs.stanford.edu
pliang@cs.stanford.edu
feifeili@cs.stanford.edu

Abstract. Natural language descriptions of videos provide a potentially rich and vast source of supervision. However, the highly-varied nature of language presents a major barrier to its effective use. What is needed are models that can reason over uncertainty over both videos and text. In this paper, we tackle the core task of person naming: assigning names of people in the cast to human tracks in TV videos. Screenplay scripts accompanying the video provide some crude supervision about who’s in the video. However, even the basic problem of knowing who is mentioned in the script is often difficult, since language often refers to people using pronouns (e.g., “he”) and nominals (e.g., “man”) rather than actual names (e.g., “Susan”). Resolving the identity of these mentions is the task of coreference resolution, which is an active area of research in natural language processing. We develop a joint model for person naming and coreference resolution, and in the process, infer a latent alignment between tracks and mentions. We evaluate our model on both vision and NLP tasks on a new dataset of 19 TV episodes. On both tasks, we significantly outperform the independent baselines.

Keywords: Person naming, coreference resolution, text-video alignment

Electronic Supplementary Material:

Electronic Supplementary Material (PDF 241 KB)

LNCS 8689, p. 95 ff.

Full article in PDF | BibTeX

Linking People in Videos with “Their” Names Using Coreference Resolution*

Linking People in Videos with “Their” Names Using Coreference Resolution^*