HON4D: Histogram of Oriented 4D Normals for
Activity Recognition from Depth Sequences
Presenter: Mitsuru NAKAZAWA@Osaka Un...
• This slide is unofficial one because I am a presenter, NOT an
author of HON4D
– Presenter: Mitsuru Nakazawa@Osaka univ. ...
University of Bonn - Institute of Computer Science III - Computer Vision Group
Histogram of 4D Surface Normals
• Surface n...
Introduction
• Compared with conventional color images, depth
maps provide several advantages
– Depth maps reflect pure ge...
Related work 1
 It is difficult to apply these methods because
– Detectors such as STIP and Dollar are not reliable in de...
Related work 2
• Holistic approaches
– instead of using local points,
a global feature is obtained
for the entire sequence...
Contributions
1. We propose a novel descriptor for activity
recognition from depth sequences, in which
we encode the distr...
HON4D can uses different bins
based on the gradient
(1)
4D Surface Normal
4th dimension encodes the magnitude of the gradi...
HOG vs. HON
HOG
HON
9
Similar distribution
Discriminable distribution
Gradient orientation is
similar for both surfaces
Normalization using the sum across all projectors
Component of each normal in each direction
Histogram of 4D Normals for e...
Projector refinement
When two different classes of activities are quite close in the
feature space such that their samples...
Experiments using existing databases
• MSR Action 3D[12], MSR Gesture 3D[23]
12
MSR Action 3D[12] MSR Gesture 3D[23]
HON4D...
New database: 3D Action Pairs Dataset
Although the two actions of each
pair are similar in motion and
shape, the motion-sh...
Local HON4D
• For the case when actors significantly change their spatial locations, and the
temporal extent of the activi...
Conclusion
• We presented a novel, simple, and easily implementable
descriptor for activity recognition from depth sequenc...
of 15

HON4D (O. Oreifej et al., CVPR2013)

Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - HON4D (O. Oreifej et al., CVPR2013)

  • 1. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences Presenter: Mitsuru NAKAZAWA@Osaka Univ., JPN Omar Oreifej† and Zicheng Liu‡ †University of Central Florida ‡Microsoft Research CVPR2013 paper introduction 1
  • 2. • This slide is unofficial one because I am a presenter, NOT an author of HON4D – Presenter: Mitsuru Nakazawa@Osaka univ. JPN – nakazawa[at]am.sanken.osaka-u.ac.jp • HON4D – http://www.cs.ucf.edu/~oreifej/HON4D.html • Required knowledge: HOG 2
  • 3. University of Bonn - Institute of Computer Science III - Computer Vision Group Histogram of 4D Surface Normals • Surface normals: • Quantization according to “projectors” pi: • Add additional discriminative “projectors” [ O. Oreifej and L. Zicheng. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. CVPR 2013 available at http://www.cs.ucf.edu/~oreifej/HON4D.html ] [URL] Accessed Sept. 25, 2013 Bins of the histogram 3
  • 4. Introduction • Compared with conventional color images, depth maps provide several advantages – Depth maps reflect pure geometry and shape cues • It seems natural to employ depth data in many computer vision problems like action recognition Would conventional RGB-based methods also perform well in depth sequences?? 4
  • 5. Related work 1  It is difficult to apply these methods because – Detectors such as STIP and Dollar are not reliable in depth sequences – Standard methods for automatically acquiring motion trajectories in color images are also not reliable in depth sequences Behavior Recognition via Sparse Spatio-Temporal Features Piotr Doll´ar Vincent Rabaud Garrison Cottrell Serge Belongie Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093 USA http://vision.ucsd.edu Abstract A common trend in object recognition is to detect and lever- age the use of sparse, informative feature points. The use of such features makes the problem more manageable while providing increased robustness to noise and pose variation. In this work we develop an extension of these ideas to the spatio-temporal case. For this purpose, we show that the direct 3D counterparts to commonly used 2D interest point detectors are inadequate, and we propose an alternative. Anchoring off of these interest points, we devise a recogni- tion algorithm based on spatio-temporally windowed data. We present recognition results on a variety of datasets in- cluding both human and rodent behavior. 1. Introduction In this work we develop a general framework for detecting and characterizing behavior from video sequences, making few underlying assumptions about the domain and subjects under observation. Consider some of the well known diffi- culties faced in behavior recognition. Subjects under ob- servation can vary in posture, appearance and size. Oc- clusions and complex backgrounds can impede observation, and variations in the environment, such as in illumination, can further make observations difficult. Moreover, there are variations in the behaviors themselves. Many of the problems described above have counter- parts in object recognition. The inspiration for our approach comes from approaches to object recognition that rely on sparsely detected features in a particular arrangement to characterize an object, e.g. [6, 1, 18]. Such approaches tend to be robust to pose, image clutter, occlusion, object vari- ation, and the imprecise nature of the feature detectors. In short they can provide a robust descriptor for objects with- out relying on too many assumptions. We propose to characterize behavior through the use of spatio-temporal feature points (see figure 1). A spatio- Figure 1: Visualization of cuboid based behavior recognition. Spatio- temporal volume of mouse footage shown at top. We apply a spatio- temporal interest point detector to find local regions of interest in space and time (cuboids) which serve as the substrate for behavior recognition. an eye opening or a knee bending, or for a mouse a paw rapidly moving back and forth. A behavior is then fully de- scribed in terms of the types and locations of feature points present. The motivation is that an eye opening can be char- acterized as such regardless of global appearance, posture, nearby motion or occlusion and so forth, for example, see figure 2. The complexity of discerning whether two behav- iors are similar is shifted to the detection and description of a rich set of features. Although the method is inspired by approaches to object recognition that rely on spatial features, video and images have distinct properties. The third dimension is temporal, not spatial, and must be treated accordingly. Detection of objects in 3D spatial volumes is a distinct problem, see for example [8]. In this work we show that direct 3D counterparts to com- monly used 2D interest point detectors are inadequate for detection of spatio-temporal feature points and propose an ime Interest Points tev and Tony Lindeberg∗ d Active Perception Laboratory (CVAP) l Analysis and Computer Science 00 44 Stockholm, Sweden v, tony} @nada.kth.se mpact In this terest w the can be ell as dea of detect s have e then events scrip- d con- e-time illus- walk- Figure 1: Result of detecting the strongest spatio-temporal interest point in a football sequence with a player heading the ball. The detected event corresponds to the high spatio- temporal variation of the image data or a “space-time cor- ner” as illustrated by the spatio-temporal slice on the right. Image structures in video are not restricted to constant velocity and/or constant appearance over time. On the con- trary, many interesting events in video are characterized by strong variations of the data in both the spatial and the tem- poral dimensions. As example, consider scenes with a per- STIP (Laptev et al. 2005) [10] Dollar (Dollar et al. 2005) [5] 5 Local interest point-based methods originally developed for color sequences
  • 6. Related work 2 • Holistic approaches – instead of using local points, a global feature is obtained for the entire sequence Yang et al. 2012 [26] Vieira et al. 2012 [21] HON4D Yang et al. 2012 [26] We demonstrate that our method captures the complex and articulated structure and motion within the sequence using a richer and more discriminative descriptor than [26] and [21] 6
  • 7. Contributions 1. We propose a novel descriptor for activity recognition from depth sequences, in which we encode the distribution of the surface normal orientation in the 4D space of depth, time, and spatial coordinates. 2. We demonstrate how to quantize the 4D space using the vertices of a polychoron, and then refine the quantization to become more discriminative. 7
  • 8. HON4D can uses different bins based on the gradient (1) 4D Surface Normal 4th dimension encodes the magnitude of the gradient z = f x, y,t( ) S x, y,t, z( ) = f x, y,t( )- z = 0 Normalize the normal to an unit length normal -1 fx, fy, ft ,1( ) T 2 8 which constitutes a surface in the 4D space represente as the set of points (x, y, t, z) satisfying S(x, y, t, z) f (x, y, t) − z = 0. The normal to the surface S is com puted as n = ∇ S = ( ∂z ∂x , ∂z ∂y , ∂z ∂t , − 1)T . ( Only theorientation of thenormal isrelevant tothesha of the 4D surface S; therefore, wenormalize the comput normal to a unit length normal ˆn. Note that the comp nents of the surface normal are the gradients in space an time, along with a scalar (− 1). Therefore, the normal o entation might falsely appear as equivalent to the gradie orientation, and thus one might expect a histogram of 4 normal orientation (HON4D) to coincide with a histogra of 3D gradient orientation (HOG3D). In fact, there is inherent difference, which allows the HON4D to captu richer information. Thenormal orientation hasoneextrad mension; therefore, the corresponding distribution over t bins is significantly different. Note that in a unit norma
  • 9. HOG vs. HON HOG HON 9 Similar distribution Discriminable distribution Gradient orientation is similar for both surfaces
  • 10. Normalization using the sum across all projectors Component of each normal in each direction Histogram of 4D Normals for each cell • Projectors obtained from 600cells (One of polychorons) – 600cells divides the 4D space uniformly with 120 vertices – 4D space is quantized using 120 vertices  Projectors 10 600cells Stella: Polyhedron Navigator http://www.software3d.com/Stella.php ! "# $%&' () *( + ",# - .( /' 0&",1( ! ,' - &' (2$- &3"4 &' # $",- .(! ' ..1( 563&3- .37' ( 8,"9' 0&",1( : 63;",# .<( =' ;36' ( 8,"9' 0&",1( ! > @ ! " *' $&A(2' C%' 60' ( DEE4! ' ..( Figure 2. Thevarious steps for computing HON4D descriptor. [26, 21] are generally simpler, computationally efficient, and often outperform local approaches. We demonstrate that our method captures thecomplex and articulated struc- tureand motion within thesequenceusing aricher and more discriminative descriptor than [21] and [26]. We addition- ally bypasstheuseof askeleton tracker, which can often be unreliable. Though, we still outperform the methods which rely on the skeleton detector such as [24]. Moreover, since global descriptors generally assume coarse spatiotemporal alignment, we show that a local version of our descriptor can be derived and employed for significantly unaligned datasets. 3. The4D Surface Normal Given a sequence of depth images { I 1, I 2 . . . I N } con- taining a person performing an activity, our goal is to com- pute a global descriptor which is able to discriminate the class of action being performed. The depth sequence can 3 1 thefourth dimension encodes t − 1/ ||(f x , f y , f t , 1)T ||2. This different binsbased on thegrad malswith different correspond fall into different bins). In cont dient orientation, the magnitu used asaweight for thebins. To better illustrate that, con in figure 3, which shows two surface1 hasahigher inclinatio ent orientation issimilar for bo ponent of thegradient along th ble. Incontrast, theorientation different. Therefore, a histogra differentiate between these su gradient orientation cannot. W periments, that the depth seque functions, from which wecan c
  • 11. Projector refinement When two different classes of activities are quite close in the feature space such that their samples mostly fall in similar bins 11 Is uniform space quantization optimal?? We set the weighting coefficients for projectors by using SVM Training HON4D descriptor Support vector Weight corresponding to the support vector x w a
  • 12. Experiments using existing databases • MSR Action 3D[12], MSR Gesture 3D[23] 12 MSR Action 3D[12] MSR Gesture 3D[23] HON4D does not use a skeleton tracker, and yet we outperform the skeleton-based method [24]
  • 13. New database: 3D Action Pairs Dataset Although the two actions of each pair are similar in motion and shape, the motion-shape relation is different 13 Skeleton + LOP Skeleton + LOP + Pyramid HON4D HON4D + Discriminative density
  • 14. Local HON4D • For the case when actors significantly change their spatial locations, and the temporal extent of the activities significantly vary 14 Experiment using MSR Daily Activity 3D [24] • Local HON4D: 80.00%  • Local occupancy pattern feature (LOP, Wang et al. 2012): 67.50% Local HON4D: Histogram of 4D Normals of spatiotemporal patches centered at skeleton joints HON4D is also superior for significantly non-aligned sequences
  • 15. Conclusion • We presented a novel, simple, and easily implementable descriptor for activity recognition from depth sequences. – We initially quantize the 4D space using the vertices of a 600-cell polychoron, and use that to compute the distribution of the 4D normal orientation for each depth sequence. – We estimate the discriminative density at each vertex of the polychoron, and induce further vertices accordingly, thus placing more emphasis on the discriminative bins of the histogram. • We showed by experiments that the proposed method outperforms all previous approaches on all relevant benchmark datasets. 15

Related Documents