Sign Language Recognition with HMM’s


Feature Engineering, Model Design, Implementation and Results


In this article, I will demonstrate how I built a system to recognize American sign language video sequences using a Hidden Markov Model (HMM). The training data is from the RWTH-BOSTON-104 database and is available here. A tracking algorithm is used to determine the cartesian coordinates of the signer’s hands and nose.

The centre of the yellow box gives the x and y location of the hands. The same procedure is applied to the nose.


American Sign Language (ASL) is the natural language choice for most hearing impaired individuals. The signers, however, have serious problems communicating with persons who are not signers. The goal of a machine recognition system would be to allow real time communication with individuals not fluent in ASL.

Why HMM’s

HMM’s have intrinsic properties which make them very attractive for time series type models. It is unnecessary to explicitly separate words in sentences for either training or recognition. HMM’s have been proven to be very effective in the speech recognition community. Naturally, they seem to be an ideal extension to machine vision problems.

Feature Engineering

The database contains a number of videos which give the location of the left hand, right hand and nose of the signer. For example, in video 98, we have:

Data for frames 0,1,2,3,4 in video 98

The objective of feature engineering is to design features which contain highly relevant information, while also keeping the number of features as small as possible. Keeping the model as simple as possible also reduces training time. I will present 5 possible feature engineering techniques. A priori, it’s very difficult to determine which technique will be most promising. We will use a development set along with appropriate evaluation metrics to determine the effectiveness of each approach.

F1. Grounded Positions

We could ground the right and left hand positions relative to the nose position, as opposed to an arbitrary origin. This new feature would capture the difference between the hand and nose position. The computation of these new features would be as follows:

asl.df['grnd-ry'] = asl.df['right-y'] - asl.df['nose-y']
asl.df['grnd-rx'] = asl.df['right-x'] - asl.df['nose-x']
asl.df['grnd-ly'] = asl.df['left-y'] - asl.df['nose-y']
asl.df['grnd-lx'] = asl.df['left-x'] - asl.df['nose-x']

F2. Normalized Positions

To account for speakers with different heights and arm lengths, we could use the standard score equation to normalize these differences. The computation would be as follows:

def computeScore(data, mean, std):
speaker = data[0]
value = data[1]
return ((value - mean[speaker])/std[speaker])

asl.df['norm-rx'] = asl.df[['speaker', 'right-x']].apply(computeScore, args=(df_means['right-x'], df_std['right-x'],), axis=1)

asl.df['norm-ry'] = asl.df[['speaker', 'right-y']].apply(computeScore, args=(df_means['right-y'], df_std['right-y'],), axis=1) 
asl.df['norm-lx'] = asl.df[['speaker', 'left-x']].apply(computeScore, args=(df_means['left-x'], df_std['left-x'],), axis=1) 
asl.df['norm-ly'] = asl.df[['speaker', 'left-y']].apply(computeScore, args=(df_means['left-y'], df_std['left-y'],), axis=1)
features_norm = ['norm-rx', 'norm-ry', 'norm-lx','norm-ly']

F3. Polar Coordinates

Another possible representation of the features would be to use polar coordinates. We swap the x and y axes in order to move the discontinuity to 12 o’clock instead of 3 o’clock. This moves the discontinuity directly above the speaker’s head, an area not generally used in signing.

def computeRadius(data):
x = data[0]
y = data[1]
x,y = y,x
radius = np.sqrt(x**2 + y**2)
return radius
def computeTheta(data):
x = data[0]
y = data[1]
x,y = y,x
theta = np.arctan2(y,x)
return theta
asl.df['polar-rr']= asl.df[['grnd-rx', 'grnd-ry']].apply(computeRadius, axis=1)
asl.df['polar-rtheta']= asl.df[['grnd-rx', 'grnd-ry']].apply(computeTheta, axis=1)
asl.df['polar-lr']= asl.df[['grnd-lx', 'grnd-ly']].apply(computeRadius, axis=1)
asl.df['polar-ltheta']= asl.df[['grnd-lx', 'grnd-ly']].apply(computeTheta, axis=1)
features_polar = ['polar-rr', 'polar-rtheta', 'polar-lr', 'polar-ltheta']

F4. Delta Differences

Yet another possibility is to use the difference in values between one frame and the next frame as features. The idea is that relative movements are more important than grounded movements. The implementation is simply:

asl.df['delta-rx'] = asl.df['right-x'].diff(1)
asl.df['delta-ry'] = asl.df['right-y'].diff(1)
asl.df['delta-lx'] = asl.df['left-x'].diff(1)
asl.df['delta-ly'] = asl.df['left-y'].diff(1)
asl.df['delta-rx'] = asl.df['delta-rx'].fillna(0)
asl.df['delta-ry'] = asl.df['delta-ry'].fillna(0)
asl.df['delta-lx'] = asl.df['delta-lx'].fillna(0)
asl.df['delta-ly'] = asl.df['delta-ly'].fillna(0)
features_delta = ['delta-rx', 'delta-ry', 'delta-lx', 'delta-ly']

F5. Normalized Polar Coordinates

Finally, we combine elements of normalization and polar coordinates in the obvious way. The basic idea here is that polar coordinates can move the discontinuity to a more favourable location and normalization scales the values appropriately.

df_means = asl.df.groupby('speaker').mean()
df_std = asl.df.groupby('speaker').std()
asl.df['norm-polar-rr'] = asl.df[['speaker', 'polar-rr']].apply(computeScore, args=(df_means['polar-rr'], df_std['polar-rr'],), axis=1)
asl.df['norm-polar-rtheta'] = asl.df[['speaker', 'polar-rtheta']].apply(computeScore, args=(df_means['polar-rtheta'], df_std['polar-rtheta'],), axis=1) 
asl.df['norm-polar-lr'] = asl.df[['speaker', 'polar-lr']].apply(computeScore, args=(df_means['polar-lr'], df_std['polar-lr'],), axis=1) 
asl.df['norm-polar-ltheta'] = asl.df[['speaker', 'polar-ltheta']].apply(computeScore, args=(df_means['polar-ltheta'], df_std['polar-ltheta'],), axis=1)
# TODO define a list named 'features_custom' for building the training set
features_custom = ['norm-polar-rr', 'norm-polar-rtheta', 'norm-polar-lr', 'norm-polar-ltheta']

Model Selection

The objective here is to choose the appropriate number of states in the HMM model for each word. The evaluation metrics we will consider are:

  • Log likelihood using cross-validation folds (CV)
  • Bayesian Information Criterion (BIC)
  • Discriminative Information Criterion (DIC)

In order to fit a HMM to a single word we will use the hmmlearn library in Python. The fit function will invoke Baum-Welch Expectation-Maximization to iteratively find the best estimate for the model for a given number of hidden states. We could train the word ‘BOOK’ as follows:

import warnings
from hmmlearn.hmm import GaussianHMM
def train_a_word(word, num_hidden_states, features):

warnings.filterwarnings("ignore", category=DeprecationWarning)
training = asl.build_training(features)
X, lengths = training.get_word_Xlengths(word)
model = GaussianHMM(n_components=num_hidden_states, n_iter=1000).fit(X, lengths)
logL = model.score(X, lengths)
return model, logL

demoword = 'BOOK'
model, logL = train_a_word(demoword, 3, features_ground)
print("Number of states trained in model for {} is {}".format(demoword, model.n_components))
print("logL = {}".format(logL))

I will now give the basic definitions of the evaluation metrics I considered. I will also compare and contrast the statistical intuitions behind them and provide an appropriate recommendation of the most effective metric.

Log likelihood using cross-validation folds (CV)

This approach simply loops over all possible folds of the data and computes an average probability of the data fitting the model. We would simply choose the number of states which maximizes this average probability.

Bayesian Information Criterion (BIC)

We choose the number of states which gives the lowest Bayesian information criterion. The formula for BIC is -2 * logL + p * logN, where p is the number of parameters in the model and N is the number of hidden states. As seen in the formula, model complexity applies a penalty in the form of p*logN.

Discriminative Information Criterion (DIC)

Here we choose the number of states which gives the highest DIC. The formula for DIC is log(P(original world)) — average(log(P(otherwords))). The idea of DIC is that we are trying to find the model that gives a high likelihood to the original word and a low likelihood to the other words. In other words we are maximizing the difference between the probability that the model fits the original word and the average probability that the model fits the other words.

Which metric should I use?

Cross validation scoring works by choosing the model with the highest average likelihood by moving over all possible hold-out possibilities. It is not biased in the sense that it considers all possible testing groups in computing a likelihood. It is simple to interpret as well. BIC does not use a hold out set. It penalizes generalization by accepting for model complexity in the formula.

BIC is a method for choosing a model with the best tradeoff between fit and complexity, and cross-validation is a method for choosing a model with the best out-of-sample predictive accuracy.

The disadvantage of BIC is that there is no guarantee that the complexity penalty will exactly offset the overfitting property. If you only care about model generalization, then cross-validation is the best option. On the other hand, DIC does not consider model complexity. The idea behind DIC is that you are trying to find the number of components which gives a high score to the current word and a low average score to the words scored against this model.

It’s basically saying that you want your model to give a high probability to the word you are currently scoring, and a low probability to the other words.


If you look on Stack Overflow there are many discussion and arguments over which of these criteria is best. I would argue that cross validation is the most effective approach due to its simplicity and unbiased scoring metric, which ultimately maximizes generalization potential.


I used a development set containing 180 sentences to test all 12 combinations of the features and evaluation metrics discussed above. The results are as follows:

Shows the accuracy of the model for each feature set and for each evaluation metric. A score of 0 means that the model has perfect predictive power on the development set.

The best combination was (Polar, DIC) and (Polar, BIC) because these have the lowest word error ratio (WER) on words not already seen by the model. The polar coordinates are shown to have the highest predictive power out of the features I considered. In general, DIC seems to be the best overall and BIC seems to be the worst. After running the tests a few times, I decided to implement (Polar, DIC) as my final model because it was slightly better on average. Note that the average sentence length was about 5 words. If we just guessed randomly we would get the correct sentence with probability (1/500)⁵ since there are 500 words in the database. That’s a WER of 1.0.


We built an impressive system which converts ASL video sequences into text sentences. Feature engineering was used to maximize generalization potential. Both the quality and quantity of the features had great influence on the effectiveness of the model.

Finally, appropriate evaluation metrics were used for hyperparameter tuning. The results indicate that polar coordinates and the DIC metric were most effective at model generalization capabilities.

For additional predictive power, we should implement a temporal model which considers the time series nature of sentences. We are currently using a 1-gram model. But we should use an n-gram model, where n > 1.

Let’s block ads! (Why?)

Machine Learnings – Medium

Get real time updates directly on you device, subscribe now.

Subscribe to our newsletter
Sign up here to get the latest news, updates and special offers delivered directly to your inbox.
You can unsubscribe at any time
You might also like

Leave A Reply

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More