There has been a lot of work in recognizing sign language. My work builds mainly on two papers:
Black and Jepson have previously applied particle filters (CONDENSATION) to the problem of gesture recognition. Their work differs from mine in that they recognized only 'whiteboard' style gestures like "save," "cut," and "paste" made with a distinctly colored object (a "phicon"). In my project, I recognize actual ASL signs made with two hands.
Yang and Ahuja also use motion trjaectories to recognize ASL signs. However, they use a Time-Delayed Nueral Network (TDNN) to classify signs, while I use a particle filter
My work, then, builds on Yang and Ahuja's in the sense that I use the motion trajectories of the hands to recognize ASL signs. Unlike Yang and Ahuja, I don't attempt to robustly solve of the problem of tracking the hands in the first place. I extend the work of Black and Jepson by applying CONDENSATION to multiple motion trajectories simultaneously.
In order to segment out skin-colored pixels, I used the color_segment routine we developed in MATLAB for our last homework assignment. Every image in every each sequence was divided into the following regions: skin, background, clothes, and outliers. The source code is here:
The original image:
I then calculated the centroids of the three largest skin colored 'blobs' in each image. Blobs were calculated by processing the skin pixel mask generated in the previous step. A blob is defined to be a connected region of 1's in the mask. Finding blobs turned out to be a bit more difficult than I had originally thought. My first implementation was a straightforward recursive algorithm which scans the top down from left to right until it comes across a skin pixel which has yet to be assigned to a blob. It then recursively checks each of that pixel's neighbors to see if they too are skin pixels. If they are, it assigns them to the same blob and recurses. On such large images, this quickly led to stack overflow and huge inefficiency in MATLAB.
The working algorithm I eventually came up with is an iterative one that scans the skin pixel mask from left to right top down. When it comes across a skin pixel that has yet to be assigned to a blob, it first checks pixels neighbors (to the left and above) to see if they are in a blob. If they aren't, it creates a new blob and adds the newly found pixel to the blob. If any of the neighbors are in a blob, it assigns the pixel to the neighbor's blob. However, two non-adjacent nieghbors might be in different blobs, so these blobs must be merged into a single blob.
Finally, the algorithm searches for the 3 largest blobs and calculates each of their respective centroids.
The MATLAB code can be found here: blob2.m
These videos show the 3 largest skin-colored blobs tracked for the two example video sequences above (ignore the funny colors -- they are just an artifact of the blob search)
The Condensation algorithm (Conditional Density Propagation over time) makes use of random sampling in order to model arbitrarily complex probability density functions. That is, rather than attempting to fit a specific equation to observed data, it uses N weighted samples to approximate the curve described by the data. Each sample consists of a state and a weight proportional to the probability that the state is predicted by the input data. As the number of samples increases, the precision with which the samples model the observed pdf increases.
Now assume that a series of observations are made during time steps
. In order to generate the new sample set at time t+1,
states are randomly selected (with replacement) from the sample set at
t, based on their weight; that is, the weight of each sample determines
the probability it will be chosen. Given such a randomly sampled state
, a prediction of a new state
at time
step t+1 is made based on a predictive model. This corresponds to
sampling from the process density
, where
is a vector of parameters describing
the object's state. Finally,
is assigned a weight
proportional to the probability
,
where
is a set of parameters describing the observed
state of the object at time t+1. Then the process iterates for the next
observation. In this way, predicted states that correspond better to the
data receive larger weights. Since arbitrarily complex pdfs can be
modeled, an arbitrary number of competing hypotheses (assuming
sufficiently large N) can be maintained until a single hypothesis
dominates.
In order to apply the Condensation Algorithm to sign-language recognition,
I extend the methods described by Black and Jepson. Specifically, a
state at time t is described as a parameter vector:
where:
is the integer index of the predictive model;
indicates the current position in the model;
refers to an amplitudal scaling factor;
is a scale factor in the time dimension.
where
Note that i indicates which hand's motion trajectory this
,
, or
refers to. My models contain data about the
motion trajectory of both the left hand and the right hand; by allowing
two sets of parameters, I allow the motion trajectory of
the left hand to be scaled and shifted separetely from the motion
trajectory of the right hand (so, for example,
refers to the
current position in the model for the left hand's trajectory, while
refers to the position in the model for the right hand's
trajectory).
In summary, there are 7 parameters that describe each state.
The sample set is initialized with N samples distributed over possible
starting states and each assigned a weight of
. Specifically,
the initial parameters are picked uniformly according to:
, where
[0,1]
In this application, I set the parameters as follows:
= 2, since there are two models
for 50 percent scaling
for 50 percent scaling
In the prediction step, each parameter of a randomly sampled
is used to determine
based on the
parameters of that particular
. Each old state,
, is randomly chosen from the sample set, based on the
weight of each sample. That is, the weight of each sample determines the
probability of its being chosen. This is done efficiently by creating a
cumulative probability table, choosing a uniform random number on [0,1],
and then using binary search to pull out a sample (see Isard and Blake for
details).
The following equations are
used to choose the new state
where
refers to a number chosen randomly according
to the normal distribution with standard deviation
. This adds
an element of uncertainty to each prediction, which keeps the sample set
diffuse enough to deal with noisy data. In this application I set:
.
For a given drawn sample, predictions are generated until all of the parameters are within the accepted range. If, after, a set number of attempts it is still impossible to generate a valid prediction, a new sample is created according to the initialization procedure above. In addition, 10 percent of all samples in the new sample set are initialized randomly as in the initialization step above (with the exception that rather than having the phase parameter biased towards zero, it is biased towards the number of observations that have been made thus far). This ensures that local maxima can't completely take over the curve; new hypothesese are always given a chance to dominate.
After the Prediction step above, there exists a new set of N predicted
samples which need to be assigned weights. The weight of each sample is a
measure of its likelihood given the observed data
. I define
as a sequence of observations for the ith coefficient over
time; specifically, let
be the
sequence of observations of the horizontal velocity of the left hand, the
vertical velocity of the left hand, the horizontal velocity of the right
hand, and the vertical velocity of the right hand respectively.
Extending Black and Jepson, I then calculate the weight by the following equation:
where
and where w is the size of a temporal window that spans back in time
(here, I take w = 10). Note that
,
, and
refer to the appropriate parameters of the model for the blob in
question and that
refers
to the value given to the ith coefficient of the model
interpolated at time
and scaled by
.
With this algorithm in place, all that remains is actually classifying the
video sequence as one of the two signs. Since the whole idea of
Condensation is that the most likely hypothesis will dominate by the end,
I chose to use the criterion of which model was deemed most likely at the
end of the video sequence to determine the class of the entire video
sequence. Determining the probability assigned to each model is a simple
matter of summing the weights of each sample in the sample set at a given
moment whose state refers to the model in question. The following
graphs plot the likelihood of each model over time for an instance of each
sign (the first is a sign that is classified as model 1, the second a sign
that is classified as model 2):
Using this criterion, my system correctly classified 80 percent of the
signs it was trained on and 75 percent of novel signs.