Back to my homepage

Using a Particle Filter for Gesture Recognition

Alexander Gruenstein

Introduction

For my final project, I experimented with applying a particle filter to the problem of gesture recognition. Specifically, I attempted to differentiate between the following two American Sign Language (ASL) signs (special thanks goes to Anna for allowing me to film her!) (please click on the signs to see a sample video): My procedure was as follows:
  1. Film 7 examples of each sign
  2. Find skin-colored pixels in every frame
  3. Find the three largest 'blobs' of skin-colored pixels in each frame (the head, right hand, and left hand)
  4. Calculate the motion trajectories of each blob over time
  5. Create a model of the average trajectories of the blobs for each of the two signs, using 5 of the examples
  6. Use a particle filter to classify both the test and training sets of video sequences as one of the two signs

Related Work

There has been a lot of work in recognizing sign language. My work builds mainly on two papers:

Black and Jepson have previously applied particle filters (CONDENSATION) to the problem of gesture recognition. Their work differs from mine in that they recognized only 'whiteboard' style gestures like "save," "cut," and "paste" made with a distinctly colored object (a "phicon"). In my project, I recognize actual ASL signs made with two hands.

Yang and Ahuja also use motion trjaectories to recognize ASL signs. However, they use a Time-Delayed Nueral Network (TDNN) to classify signs, while I use a particle filter

My work, then, builds on Yang and Ahuja's in the sense that I use the motion trajectories of the hands to recognize ASL signs. Unlike Yang and Ahuja, I don't attempt to robustly solve of the problem of tracking the hands in the first place. I extend the work of Black and Jepson by applying CONDENSATION to multiple motion trajectories simultaneously.

Filming The Examples

The image sequences were filmed using a Sony DCR-TRV900 MiniDV camera. They were manually aligned and then converted into sequences of TIFs to be processed in MATLAB. Each TIF was 243x360 pixels, 24bit color. The lighting and background in each sequence is held constant; the background is not cluttered. The focus of my project was not to solve the tracking problem, hence I wanted the hands to be relatively easy to track. I collected 7 film sequences of each sign.

Finding Skin Colored Pixels

In order to segment out skin-colored pixels, I used the color_segment routine we developed in MATLAB for our last homework assignment. Every image in every each sequence was divided into the following regions: skin, background, clothes, and outliers. The source code is here:

The original image:

The skin pixel mask:

Finding Skin-Colored Blobs

I then calculated the centroids of the three largest skin colored 'blobs' in each image. Blobs were calculated by processing the skin pixel mask generated in the previous step. A blob is defined to be a connected region of 1's in the mask. Finding blobs turned out to be a bit more difficult than I had originally thought. My first implementation was a straightforward recursive algorithm which scans the top down from left to right until it comes across a skin pixel which has yet to be assigned to a blob. It then recursively checks each of that pixel's neighbors to see if they too are skin pixels. If they are, it assigns them to the same blob and recurses. On such large images, this quickly led to stack overflow and huge inefficiency in MATLAB.

The working algorithm I eventually came up with is an iterative one that scans the skin pixel mask from left to right top down. When it comes across a skin pixel that has yet to be assigned to a blob, it first checks pixels neighbors (to the left and above) to see if they are in a blob. If they aren't, it creates a new blob and adds the newly found pixel to the blob. If any of the neighbors are in a blob, it assigns the pixel to the neighbor's blob. However, two non-adjacent nieghbors might be in different blobs, so these blobs must be merged into a single blob.

Finally, the algorithm searches for the 3 largest blobs and calculates each of their respective centroids.

The MATLAB code can be found here: blob2.m

These videos show the 3 largest skin-colored blobs tracked for the two example video sequences above (ignore the funny colors -- they are just an artifact of the blob search)

Calculating the Blobs' Motion Trajectories over Time

At this point, tracking the trajectories of the blobs over time was fairly simple. For a given video sequence, I made a list of the position of the centroid for each of the 3 largest blobs in each frame (source code: centroids.m). Then, I examined the first frame in the sequence and determined which centroid was farthest to the left and which was farthest to the right. The one on the left corresponds to the right hand of signer, the one to the right corresponds to the left hand of the signer. Then, for each successive frame, I simply determined which centroid was closest to each of the previous left centroid and called this the new left centroid; I did the same for the blob on the right. Once the two blobs were labeled, I calculated the horizontal and vertical velocity of both blobs across the two frames using [(change in position)/time]. I recorded these values for each sequential frame pair in the sequence. The source code is here: split_centroids.m.

Creating the Motion Models

I then created models of the hand motions involved in each sign. Specifically, for each frame in the sign, I used 5 training instances to calculate the average horizontal and vertical velocities of both hands in that particular frame. The following graphs show the models derived for both signs: (These turned out a bit grainy as jpegs, there are pdf versions here: model1.pdf and model2.pdf).

Model 1 "leftover":

Model 2 "paddle":


Sample source code for creating a model is here: make_model2.m

Using CONDENSATION to Classify New Video Sequences

All the image preprocessing is now finished and the 2 motion models have been created. Now follows a brief description of the Condensation algorithm and then a description of how I applied it to this specific task.

The Basics of Condensation

The Condensation algorithm (Conditional Density Propagation over time) makes use of random sampling in order to model arbitrarily complex probability density functions. That is, rather than attempting to fit a specific equation to observed data, it uses N weighted samples to approximate the curve described by the data. Each sample consists of a state and a weight proportional to the probability that the state is predicted by the input data. As the number of samples increases, the precision with which the samples model the observed pdf increases.

Now assume that a series of observations are made during time steps tex2html_wrap_inline130 . In order to generate the new sample set at time t+1, states are randomly selected (with replacement) from the sample set at t, based on their weight; that is, the weight of each sample determines the probability it will be chosen. Given such a randomly sampled state tex2html_wrap_inline136 , a prediction of a new state tex2html_wrap_inline138 at time step t+1 is made based on a predictive model. This corresponds to sampling from the process density tex2html_wrap_inline142 , where tex2html_wrap_inline144 is a vector of parameters describing the object's state. Finally, tex2html_wrap_inline138 is assigned a weight proportional to the probability tex2html_wrap_inline148 , where tex2html_wrap_inline150 is a set of parameters describing the observed state of the object at time t+1. Then the process iterates for the next observation. In this way, predicted states that correspond better to the data receive larger weights. Since arbitrarily complex pdfs can be modeled, an arbitrary number of competing hypotheses (assuming sufficiently large N) can be maintained until a single hypothesis dominates.

Applying Condensation to Recognizing ASL

In order to apply the Condensation Algorithm to sign-language recognition, I extend the methods described by Black and Jepson. Specifically, a state at time t is described as a parameter vector: tex2html_wrap_inline158 where:

tex2html_wrap_inline160 is the integer index of the predictive model;
tex2html_wrap_inline162 indicates the current position in the model;
tex2html_wrap_inline164 refers to an amplitudal scaling factor;
tex2html_wrap_inline166 is a scale factor in the time dimension.
where tex2html_wrap_inline168

Note that i indicates which hand's motion trajectory this tex2html_wrap_inline172 , tex2html_wrap_inline174 , or tex2html_wrap_inline176 refers to. My models contain data about the motion trajectory of both the left hand and the right hand; by allowing two sets of parameters, I allow the motion trajectory of the left hand to be scaled and shifted separetely from the motion trajectory of the right hand (so, for example, tex2html_wrap_inline178 refers to the current position in the model for the left hand's trajectory, while tex2html_wrap_inline180 refers to the position in the model for the right hand's trajectory).

In summary, there are 7 parameters that describe each state.

Initialization

The sample set is initialized with N samples distributed over possible starting states and each assigned a weight of tex2html_wrap_inline184 . Specifically, the initial parameters are picked uniformly according to:

tex2html_wrap_inline186
tex2html_wrap_inline188 , where tex2html_wrap_inline190 [0,1]
tex2html_wrap_inline192
tex2html_wrap_inline194

In this application, I set the parameters as follows:
tex2html_wrap_inline196 = 2, since there are two models
tex2html_wrap_inline198 for 50 percent scaling
tex2html_wrap_inline200 for 50 percent scaling

Prediction

In the prediction step, each parameter of a randomly sampled tex2html_wrap_inline136 is used to determine tex2html_wrap_inline138 based on the parameters of that particular tex2html_wrap_inline136 . Each old state, tex2html_wrap_inline136 , is randomly chosen from the sample set, based on the weight of each sample. That is, the weight of each sample determines the probability of its being chosen. This is done efficiently by creating a cumulative probability table, choosing a uniform random number on [0,1], and then using binary search to pull out a sample (see Isard and Blake for details).

The following equations are used to choose the new state

tex2html_wrap_inline210
tex2html_wrap_inline212
tex2html_wrap_inline214
tex2html_wrap_inline216

where tex2html_wrap_inline218 refers to a number chosen randomly according to the normal distribution with standard deviation tex2html_wrap_inline220 . This adds an element of uncertainty to each prediction, which keeps the sample set diffuse enough to deal with noisy data. In this application I set: tex2html_wrap_inline222 .

For a given drawn sample, predictions are generated until all of the parameters are within the accepted range. If, after, a set number of attempts it is still impossible to generate a valid prediction, a new sample is created according to the initialization procedure above. In addition, 10 percent of all samples in the new sample set are initialized randomly as in the initialization step above (with the exception that rather than having the phase parameter biased towards zero, it is biased towards the number of observations that have been made thus far). This ensures that local maxima can't completely take over the curve; new hypothesese are always given a chance to dominate.

Updating

After the Prediction step above, there exists a new set of N predicted samples which need to be assigned weights. The weight of each sample is a measure of its likelihood given the observed data tex2html_wrap_inline226 . I define tex2html_wrap_inline228 as a sequence of observations for the ith coefficient over time; specifically, let tex2html_wrap_inline232 be the sequence of observations of the horizontal velocity of the left hand, the vertical velocity of the left hand, the horizontal velocity of the right hand, and the vertical velocity of the right hand respectively.

Extending Black and Jepson, I then calculate the weight by the following equation:

equation84


where

displaymath234

and where w is the size of a temporal window that spans back in time (here, I take w = 10). Note that tex2html_wrap_inline174 , tex2html_wrap_inline172 , and tex2html_wrap_inline176 refer to the appropriate parameters of the model for the blob in question and that tex2html_wrap_inline246 refers to the value given to the ith coefficient of the model tex2html_wrap_inline160 interpolated at time tex2html_wrap_inline252 and scaled by tex2html_wrap_inline174 .

Classification

With this algorithm in place, all that remains is actually classifying the video sequence as one of the two signs. Since the whole idea of Condensation is that the most likely hypothesis will dominate by the end, I chose to use the criterion of which model was deemed most likely at the end of the video sequence to determine the class of the entire video sequence. Determining the probability assigned to each model is a simple matter of summing the weights of each sample in the sample set at a given moment whose state refers to the model in question. The following graphs plot the likelihood of each model over time for an instance of each sign (the first is a sign that is classified as model 1, the second a sign that is classified as model 2):
tex2html_wrap256 tex2html_wrap258
Using this criterion, my system correctly classified 80 percent of the signs it was trained on and 75 percent of novel signs.

Condensation Implementation

The Condensation algorithm was coded in C++ and ran much in much faster than real time on the sweet hall machines (excluding the image preprocessing). The complete source code is here:

SOURCE ARCHIVE

Note from March, 2005. This project is from 2002, for a computer vision class at stanford. I have received quite a few requests for missing pieces of the source code. As this page was meant to be a writeup, rather than to provide an actual working program, some of the source files are not included in the description. I have archived my original project directory, in all of its messy glory, and that is available here. I make no claims regarding the usability or readability of this source code, but provide it for those who have requested it in order to actually run the code. The archive is here gruenstein.tar.gz or gruenstein.zip
If you use this code for any interesting or publishable work, please send me an email and let me know.

References

For more background on gesture recognition, see my literature review here: ps or pdf

1
Michael J. Black and Allan D. Jepson. A probabilistic framework for matching temporal trajectories: Condensation-based recognition of gestures and expressions. In Proceedings 5th European Conference Computer Vision, volume 1, pages 909-924, 1998.

2
Michael Isard and Andrew Blake. Contour tracking by stochastic propagation of conditional density. In Proceedings European Conference on Computer Vision, pages 343-356, 1996.

3
Michael Isard and Andrew Blake. A mixed-state condensation tracker with automatic model-switching. In Proceedings 6th Internal Conference Computer Vision, pages 107-112, 1998.

4
Ming-Hsuan Yang and Narendra Ahuja. Recognizing hand gesture using motion trajectories. In IEEE CS Conference on Computer Vision and Pattern Recognition, volume 1, pages 466-472, June 1999.


Alexander Houston Gruenstein
Last modified: Tue May 28 01:32:21 PDT 2002