CVPR Video Reviews

Markerless Motion Capture with Unsynchronized Moving Cameras

Nils Hasler*, MPI Informatik

Bodo Rosenhahn, Hannover University

Thorsten Thormaehlen, MPI Informatik

Michael Wand, Saarland University

Juergen Gall, BIWI, ETH Zurich

Hans-Peter Seidel, MPI Informatik

In this work we present an approach for markerless motion capture (MoCap) of articulated objects, which are recorded with multiple unsynchronized moving cameras. Instead of using fixed (and expensive) hardware synchronized cameras, this approach allows us to track people with off-the-shelf handheld video cameras.

To prepare a sequence for motion capture, we first reconstruct the static background and the position of each camera using Structure-from-Motion (SfM). Then the cameras

are registered to each other using the reconstructed static background geometry.

Camera synchronization is achieved via the audio streams recorded by the cameras in parallel. Finally, a markerless MoCap approach is applied to recover positions and joint configurations of subjects. Feature tracks and dense background geometry are further used to stabilize the MoCap. The experiments show examples with highly challenging indoor and outdoor scenes.

Predicting High Resolution Image Edges with a Generic, Adaptive, 3-D Vehicle Model

Matthew Leotta*, Brown University

This video demonstrates the algorithm for learning a generic vehicle model and using it to recovery 3-d vehicle shape from real images as described in the related CVPR 2009 paper.  The following is a transcript of the video narration:

Predicting High Resolution Image Edges with a Generic, Adaptive, 3-D Vehicle Model.

CAD Models are used as training data for the generic vehicle model.  A deformable template mesh is fit to CAD model body.  The template is subdivided and fit again to model fine resolution detail.

2-d parts with different material properties are modeled on the vehicle surface.  The boundaries of these parts are projected onto the template mesh, and then mapped into texture space coordinates.

A set of template parts is fit to the projected CAD model parts.  This is necessary because the sampling of projected parts varies from vehicle to vehicle.

The same template mesh and template parts are fit to many other CAD models of different vehicles.  This generic vehicle model can now deform to match the shape and appearance of a wide variety of passenger vehicles including sedans, station wagons, minivans, SUVs, and pickup trucks.  Notice that the part boundaries move freely on the vehicle surface.  They are not constrained by the topology of the 3-d mesh.  A mean vehicle is produced by computing the average vertex locations of the template when fit to 79 different vehicles.

Principal component analysis reduces the dimension of the shape space by finding the directions of maximum variation relative to the mean.

Occluding contours and part boundaries are used to predict to appearance of intensity edges in images of vehicles.  Gradient base optimization iteratively estimates the shape and pose parameters by minimizing the distance between predicted and detected edges in multiple calibrated views.  This vehicle shape is recovered using 10 PCA shape parameters.

Motion Capture Using Joint Skeleton Tracking and Surface Estimation

Juergen Gall*, BIWI, ETH Zurich

Carsten Stoll,

Edilson De Aguiar,

Christian Theobalt,

Bodo Rosenhahn, Hannover University

Hans-Peter Seidel, MPI Informatik

This video shows results of our method for capturing the performance of a human or an animal from a multi-view video sequence, which is described in the paper "Motion Capture Using Joint Skeleton Tracking and Surface Estimation" (CVPR09). Given an articulated template model and silhouettes from a multi-view image sequence, our approach recovers not only the movement of the skeleton, but also the possibly non-rigid temporal deformation of the 3D surface.

While large scale deformations or fast movements are captured by the skeleton pose and approximate surface skinning, true small scale deformations or non-rigid garment motion are captured by fitting the surface to the silhouette.

We show on various sequences that our approach can capture the 3D motion of animals and humans accurately even in the case of rapid movements and wide apparel like skirts.

Rank Priors for Continuous Non-Linear Dimensionality Reduction

Andreas Geiger*, KIT

Raquel Urtasun, EECS Berkeley

Trevor Darrell, EECS Berkeley

Discovering the underlying low-dimensional latent structure in high-dimensional perceptual observations (e.g., images, video) can, in many cases, greately improve performance in recognition and tracking. However, non-linear dimensionality reduction methods are

often susceptible to local minima and perform poorly when initialized far from the global optimum, even when the intrinsic dimensionality is known a priori.

In this work we introduce a prior over the dimensionality of the latent space that penalizes high dimensional spaces, and simultaneously optimize both the latent space and its intrinsic dimensionality in a continuous fashion. Ad-hoc initialization schemes are unnecessary with our approach; we initialize the latent space to the observation space and automatically infer the latent dimensionality. We report results applying our prior to various probabilistic non-linear dimensionality reduction tasks, and show that our method can outperform graph-based dimensionality reduction techniques as well as previously suggested initialization strategies.

We demonstrate the effectiveness of our approach when tracking and classifying human motion. This video shows our method and illustrates some of our results.

Similarity Metrics and Efficient Optimization for Simultaneous Registration

Christian Wachinger*, TU München

Nassir Navab, TU München

We address the alignment of a group of images with simultaneous registration. Therefore, we provide further insights into a recently introduced class of multivariate similarity measures referred to as accumulated pair-wise estimates (APE) and derive efficient optimization methods for it. More specifically, we show a strict mathematical deduction of APE from a maximum-likelihood framework and establish a connection to the congealing framework. This is only possible after an extension of the congealing framework with neighborhood information. Moreover, we address the increased computational complexity of simultaneous registration by deriving efficient gradient-based optimization strategies for APE: Gauß-Newton and the efficient second-order minimization (ESM). We present next to SSD, the usage of the intrinsically non-squared similarity measures NCC, CR, and MI, in this least-squares optimization framework. Finally, we evaluate the performance of the optimization strategies with respect to the similarity measures, obtaining very promising results for ESM.

LidarBoost: Depth Superresolution for ToF 3D Shape Scanning

Sebastian Schuon*, Stanford University

Christian Theobalt, Stanford University

James Davis, UC Santa Cruz

Sebastian Thrun, Stanford University

Depth maps captured with time-of-flight cameras have very low data quality: the image resolution is rather limited and the level of random noise contained in the depth maps is very high. Therefore, such flash lidars cannot be used out of the box for high-quality 3D object scanning. To solve this problem, we present LidarBoost, a 3D depth superresolution method that combines several low resolution noisy depth images of a static scene from slightly displaced viewpoints, and merges them into a high-resolution depth image. We have developed an optimization framework that uses a data fidelity term and a geometry prior term that is tailored to the specific characteristics of flash lidars. We demonstrate both visually and quantitatively that LidarBoost produces better results than previous methods from the literature.

Detecting Carried Objects from Video Sequences

Dima Damen*, University of Leeds

David Hogg, University of Leeds

The video reviews a novel method for detecting objects such as bags carried by pedestrians depicted in short video sequences. It demonstrates the algorithm previously presented in a publication at ECCV2008, along with new results on a wider dataset. In common with earlier work on the same problem, the method starts by averaging aligned foreground regions of a walking pedestrian to produce a representation of motion and shape (known as a temporal template) that has some immunity to noise in foreground segmentations and phase of the walking cycle. Our key novelty is for carried objects to be revealed by comparing the temporal templates against view-specific exemplars generated offline for unencumbered pedestrians. A likelihood map obtained from this match is combined in a Markov random field with a map of prior probabilities for carried objects and a spatial continuity assumption, from which we obtain a segmentation of carried objects using the MAP solution. Although developed for a specific problem, the method could be applied to the detection of irregularities in appearance for other categories of object that move in a periodic fashion.

Active Tracking of Two Free-Moving Targets with a Stereo Head

Luis Perdigoto*, University of Coimbra

Joao Barreto, ISR

Rui Caseiro, ISR

Helder Araujo, ISR

Active tracking consists in controlling the degrees of freedom (DOF) of robotized cameras, such that specific scene objects are imaged in a certain manner. An example of active tracking is fixation, where camera control assures that the gaze direction is maintained on the same object over time. Fixation can be performed with either one camera (monocular fixation) or two cameras (binocular fixation).

In this demonstration we expand from the single focus of attention, and perform the simultaneous tracking of two free-moving targets with a four DOF stereo head. Just as in monocular and binocular fixation, the vision system is able to control the way the targets are imaged.

To the best of our knowledge, this is the first experiment in active tracking N>1 free-moving targets with a stereo head.

 

Observe Locally, Infer Globally: a Space-Time MRF for Detecting Abnormal Activities with Incremental Updates

Jaechul Kim*, UT-Austin

Kristen Grauman, UT-Austin

This video shows examples of the abnormal activities our algorithm detects in hours of real videos collected at the subway station. We consider abnormal events to be those that are statistical outliers: that is, previously unseen or rarely occurring activities. To measure the abnormality in a statistical sense, we first learn normal patterns of activity at each local region in a video frame by capturing the distribution of its typical optical flow patterns with a Mixture of Probabilistic Principal Component Analyzers (MPPCA). Then, we build a space-time Markov Random Field (MRF) model to detect abnormal activities. The nodes in the MRF graph correspond to a grid of local regions in the video frames, and neighboring nodes in both space and time are associated with links. For any new optical flow patterns detected in incoming video clips, we use the learned MPPCA model and MRF graph to compute a maximum a posteriori estimate of the degree of normality at each local node. Further, we incrementally update the current model parameters as new video observations stream in, so that the model can efficiently adapt to visual context changes over a long period of time. Our space-time MRF model robustly detects abnormal activities both in a local and global sense: not only does it accurately localize the atomic abnormal activities in a crowded video, but at the same time it captures the global-level abnormalities caused by irregular interactions between local activities.

A more detailed description of the algorithm is presented in the paper, "Observe Locally, Infer Globally: a Space-Time MRF for Detecting Abnormal Activities with Incremental Updates", CVPR 2009.

Robust Scene Flow using Binocular Stereo Sequences in Near-Real-Time

Tobi Vaudrey*, University of Auckland

Thomas Brox,

Clemens Rabe,

Andreas Wedel,

Uwe Franke, Daimler

Daniel Cremers,

This video presents a technique for estimating the three-dimensional velocity vector field that describes the motion of each visible scene point (scene flow). The technique presented uses two consecutive image pairs from a stereo sequence. The main contribution is to decouple the image position and image velocity (optical flow and disparity change) estimation steps, and to estimate dense image velocities using a variational approach. We enforce the scene flow to yield consistent displacement vectors in the left and right images. The decoupling strategy has two main advantages: Firstly, we are independent in choosing a disparity estimation technique, which can yield either sparse or dense correspondences, and secondly, we can achieve frame rates of 10~fps using a GPU implementation.

From Structure-from-Motion Point Clouds to Fast Location Recognition

Arnold Irschara*, TU Graz

Christopher Zach,

Jan-Michael Frahm, University of North Carolina at Chapel Hill

Horst Bischof, TU Graz

Efficient view registration with respect to a given 3D reconstruction has many applications like inside-out tracking in indoor and outdoor environments, and geo-locating images from large photo collections. We present a fast location recognition technique based on structure from motion point clouds. Vocabulary tree-based indexing of features directly returns relevant fragments of 3D models instead of documents from the images database. Additionally, we propose a compressed 3D scene representation which improves recognition rates while simultaneously reducing the computation time and the memory consumption. The design of our method is based on algorithms that efficiently utilize modern graphics processing units to deliver real-time performance for view registration. We demonstrate the approach by matching hand-held outdoor videos to known 3D urban models, and by registering images from online photo collections to the corresponding landmarks.

Human Action Recognition with Interest Points and Camera Motion Compensation

Krystian Mikolajczyk*, University of Surrey

This video shows  an approach to human action recognition via local feature tracking and robust estimation of background motion. Multiple interest point detectors are used to provide large number of features for every frame. The motion vectors for the features are estimated using optical flow and SIFT based matching. The features are combined with image segmentation to estimate dominant homographies, and then separated into static and moving ones regardless the camera motion. The action recognition approach can handle camera motion, zoom, human appearance variations, background clutter and occlusion. The motion compensation shows very good accuracy on a number of test sequences. The recognition system has been extensively tested on standard video as well as reals sport actions.

Stereo Matching with Nonparametric Smoothness Priors in Feature Space

Brandon Smith*, University of Wisconsin-Madiso

Li Zhang, University of Wisconsin-Madison

Hailin Jin, Adobe Systems Incorporated

We propose a novel formulation of stereo matching that considers each pixel as a feature vector. Under this view, matching two or more images can be cast as matching point clouds in feature space. We build a nonparametric depth smoothness model in this space that correlates the image features and depth values. This model induces a sparse graph that links pixels with similar features, thereby converting each point cloud into a connected network. This network defines a neighborhood system that captures pixel grouping hierarchies without resorting to image segmentation. We formulate global stereo matching over this neighborhood system and use graph cuts to match pixels between two or more such networks. We show that our stereo formulation is able to recover surfaces with different orders of smoothness, such as those with high-curvature details and sharp discontinuities. Furthermore, compared to other single-frame stereo methods, our method produces more temporally stable results from videos of dynamic scenes, even when applied to each frame independently.

Proposal of  Inside-Out Camera for Measuring 3D Gaze Position in Free Space

Kazuaki Nishio, chubu university

Makoto Kimura, CREST, Japan Science and Technology Arency

Tomoyuki Nagahashi*, Chubu University

Hironobu Fujiyoshi, chubu university

Yutaka Hirata, chubu university

We propose a camera system called inside-out camera, for measuring 3D gaze position. The camera system consists of two types of camera called as "eye camera" and "scene stereo camera". The eye camera can capture images of both eyes through IR mirror, and the proposed system can compute a line of sight for each eye respectively. Since the scene stereo camera is placed like at location of eyeball using half mirror, the proposed system can determine the gaze position of the stereo images of scene stereo camera. Then the proposed system can measure the distance to the gaze point in 3D space by triangulation with the optimal correction. We confirmed that the proposed system is effective than that of conventional system.

Real-Time O(1) Bilateral Filtering

Qingxiong Yang*, University of Illinois, Urbana

Kar-Han Tan,

Narendra Ahuja, University of Illinois, Urbana

We propose a new bilateral filtering algorithm with computational complexity invariant to filter kernel size, so-called O(1) or constant time in the literature. By showing that a bilateral filter can be decomposed into a number of constant time spatial filters, our method yields a new class of constant time bilateral filters that can have arbitrary spatial (an IIR O(1) solution needs to be available for the kernel) and arbitrary range kernels. In contrast, the current available constant time algorithm requires the use of specific spatial or specific range kernels. Also, our algorithm lends itself to a parallel implementation leading to the first real-time O(1) algorithm that we know of. Meanwhile, our algorithm yields higher quality results since we are effectively quantizing the range function instead of quantizing both the range function and the input image. Empirical experiments show that our algorithm not only gives higher PSNR, but is about 10x faster than the state-of-the-art. It also has a small memory footprint, needed only 2% of the memory required by the state-of-the-art for obtaining the same quality as exact using 8-bit images. We also show that our algorithm can be easily extended for O(1) median filtering. Our bilateral filtering algorithm was tested in a number of applications, including HD video conferencing, video abstraction, highlight removal, and multi-focus imaging.

Vision Based Mobile Mapping

Frank Verbiest*, Katholieke Universiteit Leuven

Luc Van Gool,

Maarten Vergauwen,

Marc Olijslagers,

We present results of a mobile mapping system developed by KU Leuven and its spinoff-company GeoAutomation. A van with pre-calibrated cameras mounted on top, records images as it drives through the streets. State-of-the-art computer vision techniques yield the relative position of the van with respect to the previous recording instant. The resulting structure and motion are subject to the effects of drift and will also not be geo-referenced. Therefore, we use known ground control points, one point about every 50m only. These points can either be obtained through static GPS measurementor topography. Finally, all information is combined in a bundle adjusment to maximize accuracy.

At present, the application we focus on is digital surveying. Operators measure structures of interest directly on our geo-referenced images, from the comfort of their desk, rendering the traditional, costly field campaigns obsolete. Plugins let popular CAD programs, such as Microstation and Autocad, access the data. Both in Europe and North-America surveying campaigns have successfully been concluded, with maximum errors of 15 cm or less at a distance of 20 m from the street.

Markerless Motion Capture of Skinned Models

Luca Ballan*, ETH Zurich

The results showed in this video received the best paper award at the 3DPVT 2008. The aim of this system is to recover the motion of a character using only four video cameras, i.e., it consists in a markerless motion capture system. The actor is modeled as a linear blend skinned deformable model acquired using a passive body scanner based on a single video camera. The pose of the actor in each frame is recovered by an optimization procedure which exploits the properties of the LBS deformation model (aka Skeletal Subspace Deformation, SSD) and uses, as motion cues, both the silhouettes and the optical flows extracted from each recorded video. These choices allow us to capture also the movements of the small and high flexible parts of the body, such as the clavicles and the back, accounting for their non-rigid deformations during the pose estimation.

Temporal Dithering of Illumination for Fast Shape Acquisition

Shuntaro Yamazaki*, AIST

Koppal Sanjeev, CMU

Srinivasa Narasimhan, CMU

Active vision techniques use programmable light sources, such as projectors, whose intensities can be controlled over space and time. We present a broad framework for fast active vision using Digital Light Processing (DLP) projectors.  The digital micromirror array (DMD) in a DLP projector is capable of switching mirrors ``on'' and ``off'' at high speeds (10^6s). An off-the-shelf DLP projector, however, effectively operates at much lower rates (30-60Hz) by emitting smaller intensities that are integrated over time by a sensor (eye or camera) to produce the desired brightness value. Our key idea is to exploit this ``temporal dithering'' of illumination, as observed by a high-speed camera. The dithering encodes each brightness value uniquely and may be used in conjunction with virtually any active vision technique. In this research, we  apply our approach to structured light-based range finding.

Coplanar shadowgrams for acquiring visual hulls of intricate objects

Shuntaro Yamazaki*, AIST

Srinivasa Narasimhan, CMU

Simon Baker, MSR

Takeo Kanade, CMU

Acquiring 3D models of intricate objects (like tree branches, bicycles and insects) is a hard problem due to severe self-occlusions, repeated thin structures and surface discontinuities. In theory, a shape-from-silhouettes (SFS) approach can overcome these difficulties and use many views to reconstruct visual hulls that are close to the actual shapes. In practice, however, SFS is highly sensitive to errors in silhouette contours and the calibration of the imaging system, and therefore not suitable for obtaining reliable shapes with a large number of views. We present a practical approach to SFS using a novel technique called coplanar shadowgram imaging, that allows us to use dozens to even hundreds of views for visual hull reconstruction. Here, a point light source is moved around an object and the shadows (silhouettes) cast onto a single background plane are observed. We characterize this imaging system in terms of image projection, reconstruction ambiguity, epipolar geometry, and shape and source recovery. The coplanarity of the shadowgrams yields novel geometric properties that are not possible in traditional multi-view camera-based imaging systems. These properties allow us to derive a robust and automatic algorithm to recover the visual hull of an object and the 3D positions of light source simultaneously, regardless of the complexity of the object. We demonstrate the acquisition of several intricate shapes with severe occlusions and thin structures, using 50 to 120 views.

Learning General Optical Flow Subspaces for Egomotion Estimation and Detection of Motion Anomalies

Richard Roberts*, Georgia Inst. of Technology

Christian Potthast, Georgia Inst. of Technology

Frank Dellaert, Georgia Inst. of Technology

This work deals with estimation of dense optical flow and ego-motion in a generalized imaging system by exploiting probabilistic linear subspace constraints on the flow. We deal with the extended motion of the imaging system through an environment that we assume to have some degree of statistical regularity. For example, in autonomous ground vehicles the structure of the environment around the vehicle is far from arbitrary, and the depth at each pixel is often approximately constant. The subspace constraints hold not only for perspective cameras, but in fact for a very general class of imaging systems, including catadioptric and multiple-view systems. Using minimal assumptions about the imaging system, we learn a probabilistic subspace constraint that captures the statistical regularity of the scene geometry relative to an imaging system. We propose an extension to probabilistic PCA (Tipping and Bishop, 1999) as a way to robustly learn this subspace from recorded imagery, and demonstrate its use in conjunction with a sparse optical flow algorithm. To deal with the sparseness of the input flow, we use a generative model to estimate the subspace using only the observed flow measurements. Additionally, to identify and cope with image regions that violate subspace constraints, such as moving objects, objects that violate the depth regularity, or gross flow estimation errors, we employ a per-pixel Gaussian mixture outlier process. We demonstrate results of finding the optical flow subspaces and employing them to estimate dense flow and to recover camera motion for a variety of imaging systems in several different environments.

Sweethearting Detection in a Retail Checkout Environment

Quanfu Fan*, IBM

Employee-related theft is a pervasive problem in the retail environment.  According to many sources, this type of theft accounts for most of retail shrink. A particular variation of this fraud, known as “fake scanning,” is considered by many retailers to be one of the largest sources of retail shrink.  In this video, we discuss the impact of checkout-related theft on the retail industry and demonstrate how fake scanning is performed.  We then detail how our real-time vision-based system detects this type of fraud, operating on data collected from a real grocery store.  In our system, we segment motion in three salient regions of the checkout station that correspond to the scanning behavior of the cashier.  Initial motion segments are further verified through spatiotemporal features with a bag of features model.  We then exploit the strong temporal ordering of scanning behavior to find all possible primitive triplets, where each primitive corresponds to segmented motion in a distinct region. An optimal path of visual scans is then obtained through a constrained Viterbi algorithm, which is based on the cashier’s tendency to process items as quickly as possible.  We further talk about the complexities of cashier behavior that create challenging problems for an activity recognition system and explain how our approach was formulated to deal with these issues.  Finally, we conclude with brief remarks about future directions.

“Smart Room” with Real-time Multi-camera People Tracking

Kyungnam Kim*, HRL Laboratories

Swarup Medasani,

Yuri Owechko,

A “Smart Room” demo video as a part of the Contextual Visual Dataspace research project is presented in the CVPR 2009 Video Review.  The system with multiple cameras installed in a room tracks people real-time, shows each person’s trajectory and walking direction, and reports the number of people in the room. The system uses two quad-core CPUs to perform real-time processing. In the end of the video, a smart room game demo (breakout) is presented.

A Robust Approach for Automatic Registration of Aerial Images with Untextured Aerial LiDAR Data

Lu Wang*, University of Southern Califor

Ulrich Neumann, USC

Airborne LiDAR technology draws increasing interest in large-scale 3D urban modeling in recent years. 3D Li-DAR data typically has no texture information. To generate photo-realistic 3D models, oblique aerial images are needed for texture mapping, in which the key step is to obtain accurate registration between aerial images and untextured 3D LiDAR data. We present a robust automatic registration approach. A novel feature called 3CS is proposed which is composed of connected line segments. Putative line segment correspondences are obtained by matching 3CS features detected from both aerial images and 3D LiDAR data. Outliers are removed with a two-level RANSAC algorithm that integrates local and global processing to improve robustness and efficiency. The approach has been tested on 2290 aerial images that cover a variety of urban environments in Oakland and Atlanta areas. Its correct pose recovery rate is over 98%.